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Preface 


The material in this book is based on my several years’ experience in 
construction and evaluation of examinations, first as a member of the 
Board of Examinations of the University of Chicago, later as director 
of a war research project developing aptitude and achievement tests 
for the Bureau of Naval Personnel, and at present as research adviser 
for the Educational Testing Service. Collection and presentation of the 
material have been furthered for me by teaching courses in statistics 
and test theory at The University of Chicago and now at Princeton 
University. 

During this time I have become aware of the necessity for a firm 
grounding in test theory for work in test development. When this book 
was begun the material on test theory was available in numerous articles 
scattered through the literature and in books written some time ago, and 
therefore not presenting recent developments. It seemed desirable to 
me to bring the technical developments in test theory of the last fifty 
years together in one readily available source. 

Although this book is written primarily for those working in test 
development, it is interesting to note that the techniques presented here 
are applicable in many fields other than test construction. Many of 
the difficulties that have been encountered and solved in the testing 
field also confront workers in other areas, such as measurement of atti¬ 
tudes or opinions, appraisal of personality, and clinical diagnosis. For 
example, in each of these fields the error of measurement is large com¬ 
pared to the differences that the scientist is seeking; hence the methods 
of dealing with and reducing error of measurement developed in con¬ 
nection with testing are pertinent. Methods of adjusting results to 
take account of differences in group variability have been developed in 
testing, and they are helpful in arriving at appropriate conclusions when¬ 
ever the apparent results of an experiment are affected by group varia¬ 
bility. If measurements in any field are to merit confidence, the scien¬ 
tist must demonstrate that they are repeatable. Thus the theoretical 
and experimental work on reliability as developed for tests may be 
utilized in numerous other areas, as, for example, clinical diagnosis or 
personality appraisal. In any situation where a single decision or a 
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diagnosis is made on the basis of several different types of evidence, the 
material on weighting methods is applicable. 

As the writing of the book progressed, there were several develop¬ 
ments that seemed especially interesting to me from a technical point 
of view. The basic formulas of test theory are derived from two different 
sets of assumptions. In Chapter 2 the derivation is based on a definition 
of random error, with true score being simply a difference between 
observed and error score. In Chapter 3 the same formulas are derived 
from a definition of true score, error being the difference between ob¬ 
served and true scores. The treatment of the effects of test length and 
of group heterogeneity in terms of invariants may enable workers in the 
field of statistical theory to furnish test technicians with appropriate 
statistical criteria for this invariance. The distinction between explicit 
and incidental selection and the development of the theory of incidental 
selection for the multivariate case may facilitate the proper use of correc¬ 
tions for restriction of range. It was especially interesting to me to 
work on the beginning of a rationale for power tests that are partly 
speeded, as given in Chapter 17. This theory should help in determining 
test time limits and their effect on estimates of reliability. The initia¬ 
tion of a systematic mathematical theory for item analysis, as indicated 
in Chapter 21, should help in constructing tests that are suited for 
different specific purposes. I hope that the discussion of weighting 
methods presented in Chapter 20 will assist in clarifying problems in 
this field. 

Illustrative computing diagrams are given for those formulas simple 
enough to be changed into a linear form. In general, if we work with 
curved lines on a computing diagram, the labor of computing a large 
number of points for each line is prohibitive for the individual worker. 
If the diagram can be made up of straight lines, it is necessary to compute 
only one or two points for each line, and a graph of the proper size 
and scaled with values appropriate for any particular operation can be 
set up in a few hours. To be used for actual computing, such diagrams 
must be much larger than the illustrative ones given here. A minimum 
size of 8 by 10 inches and a large size of 20 by 30 inches will usually be 
found suitable. I hope that the diagrams given here will illustrate the 
principles of construction so well that any worker who has occasion to 
use any one of these formulas a great number of times can construct 
a larger and more detailed diagram with a scale appropriate to his 
own problem. 

The major part of this book is designed for readers with the following 
preparation: 
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1. A knowledge of elementary algebra, including such topics as the 
binomial expansion, the solution of simultaneous linear equations, and 
the use of tables of logarithms. 

2. Some familiarity with analytical geometry, emphasizing primarily 
the equation of the straight line, although some use is made of the 
equations for the circle, ellipse, hyperbola, and parabola. 

3. A knowledge of elementary statistics, including such topics as the 
computation and interpretation of means, standard deviations, corre¬ 
lations, errors of estimate, and the constants of the equation of the 
regression line. It is assumed that the students know how to make and 
to interpret frequency diagrams of various sorts, including the histo¬ 
gram, frequency polygon, normal curve, cumulative frequency curve, 
and the correlation scatter diagram. Familiarity with tables of the 
normal curve and with significance tests is also assumed. 

A brief r6sum6 of the major formulas from algebra, analytical geom¬ 
etry, and statistics that are assumed in this book is given in Appendix A. 

In order to include a more complete coverage of some major topics 
in test theory, certain exceptions were made to the foregoing require¬ 
ments. 

1. Chapter 5, sections 3 and 4, assumes an elementary knowledge of 
analysis of variance, including a first-order interaction. 

2. Chapter 10, section 4, assumes a knowledge of the least squares 
procedure for fitting a second-degree curve. 

3. Chapter 13 assumes an understanding of the elements of matrix 
theory. 

4. Chapter 14 assumes the ability to use determinants for those cases 
involving more than three variables. For the case of three tests, the 
formulas of Chapter 14 are written without the use of determinantal 
notation. 

5. Chapter 20, sections 2 and 4, assumes a knowledge of maxima and 
minima in calculus and of the solution of simultaneous linear equations 
by determinants. 

6. Chapter 20, sections 12 and 13, assumes an understanding of the 
use of the Lagrange multiplier in solving a matrix equation. 

The rest of the material in the book has been written so that the parts 
requiring advanced preparation may be omitted without disturbing the 
continuity of the material. 

My suggestion is that students with the minimum preparation in 
elementary algebra and statistics should omit the following material: 
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Chapter 5 (sections 3 and 4), Chapter 10 (section 4), Chapter 13, and 
Chapter 20 (sections 1 to 4,12, and 13). If lack of time makes it neces¬ 
sary to curtail assignments still further, Chapters 3, 7, 14, and 17 may 
be omitted also without disturbing the continuity of the treatment. Ad¬ 
vanced students whose preparation includes calculus, matrix theory, 
and analysis of variance may not need to study Chapters 2, 6, and 11. 

Harold Gulliksen 

Princeton, New Jersey 
August, 1950 
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Symbols 


Although certain chapters required a special set of symbols, generally 
the following notation is used in this book. 

X, Y, Z, or W denotes the gross score or raw score on a test. 
i and j are subscripts designating persons. 
g and h are subscripts designating tests. 

x, y, z, or w denotes deviation scores, the gross score minus the mean. 
N denotes total number of persons in a group. 
n denotes the number of persons in a subgroup. 

K denotes the number of items in a test, or the number of tests in 
a battery. 

k denotes the number of items in a subtest. 

T equals the gross true score. 

t equals the deviation true score, the gross true score minus the mean 
of these scores. 

E equals the gross error score (random error). 
e equals the deviation error score (since the average error is zero, 
E = e). 

M, m, X, P equal the sample mean. 

S, s, X, ? equal the sample standard deviation. 
r and R equal a sample correlation coefficient. 

H equals the population mean. 

<r equals the standard deviation for the population. 
p equals the correlation for the population, 
f equals the covariance for the population. 
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Introduction 


It is interesting to note that during the 1890’s several attempts were 
made in this country to utilize the new methods of measurement of 
individual differences in order to predict college grades. J. McKeen 
Cattell and his student Clark Wissler tried a large number of psycho-' 
logical tests and correlated them with grades in various subjects at 
Columbia University; see Cattell (1890), Cattell and Farrand (1896), 
and Wissler (1901). The correlations between the psychological tests 
and the grades were around zero, the highest correlation being .19. A 
similar attempt by Gilbert (1894), at Yale, produced similarly disap¬ 
pointing results. 

Scientific confidence in the possibilities of measuring individual dif¬ 
ferences revived in this country with the introduction of the Binet 
scale and the quantitative techniques developed by Karl Pearson and 
Charles Spearman at the beginning of the twentieth century. Nearly 
all the basic formulas that are particularly useful in test theory are 
found in Spearman’s early papers; see Spearman (1904a), (19046), 
(1907), (1910), and (1913). Since then development of both the theory 
and the practical aspects of aptitude and achievement testing has pro¬ 
gressed rapidly. Aptitude and achievement tests are widely used in 
education and in industry. 

Since 1900 great progress has been made toward a unified quantita¬ 
tive theory that describes the behavior of test items and test scores 
under various conditions. This mathematical rationale applicable to 
mental tests should not be confused with statistics. A good foundation 
in elementary statistics and elementary mathematics is a prerequisite 
for work in the theory of mental tests. In addition, as the theory of 
mental tests is developed, the necessity arises for various statistical 
criteria to determine whether or not a given set of test data agrees 
with the theory, within reasonable sampling limits. The theory, how¬ 
ever, must first be developed without consideration of sampling errors, 
and then the statistical problems in conjunction with sampling can be 
considered. 
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2 Hie Theory of Mental Tests [Chap. 1 

This book deals with the mathematical theory and statistical methods 
used in interpreting test results. There are numerous non-quantitative 
problems involved in constructing aptitude or achievement tests that 
are not considered here. Non-quantitative problems such as choice of 
item types or matching the examination to the objectives of a curricu¬ 
lum are discussed in the University of Chicago Manual of Examination 
Methods (1637); Englehart (1942); Hawkes, Lindquist, and Mann 
(193.6); Hull (1928); Orleans (1937); Ruch (1929); and others. There¬ 
fore, no attempt is made here to familiarize the student with the various 
psychological and educational tests now available or with the scope of 
the many testing programs. Such material is surveyed in yearbooks by 
Buros (1936), (1937), (1938), (1941), and (1949); Hildreth (1939); Lee 
and Symonds (1934); the National Society for the Study of Education, 
the 17th Yearbook (1918); Ruger (1918); Whipple (1914), (1915); Free¬ 
man (1939); Mursell (1947); Ross (1947); Goodenough (1949); Cron- 
bach (1949); and other general textbooks listed in the bibliography. 

In constructing tests, analyzing and interpreting the results, there are 
five major types of problems: 

1. Writing and selecting the test items. 

2. Assigning a score to each person. 

3. Determining the accuracy (reliability or error of measurement) of 
the test scores. 

4. Determining the predictive value of the test scores (validity or 
error of estimate). 

5. Comparing the results with those obtained using other tests or 
other groups of subjects. In making these comparisons, it is neces¬ 
sary to consider the effect of test length and group heterogeneity 
on the various measures of the accuracy and the predictive value 
of the test scores. 

In dealing with any given test these problems would arise chronolog¬ 
ically in the order in which they are given above. However, the theory 
of the selection of test items depends upon comparing them with some 
test score or scores; therefore it is convenient to consider first the theory 
dealing with the accuracy of these test scores. Similarly the evaluation 
of experimental methods of determining reliability and the discussion 
of practical methods of setting up parallel tests depend upon a theoreti¬ 
cal concept of reliability and of parallel tests. Therefore, instead of 
beginning with practical problems of item selection, experimental 
methods of determining reliability, or of setting up parallel tests, we 
shah begin with the theoretical constructs. . 

An ideal model will be set up giving the measures of accuracy of test 
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scores and the theoretical effects of changes in test length and in group 
heterogeneity. The theory of these changes will be derived from as¬ 
sumptions regarding parallel tests and selection procedures, without 
inquiring very closely into the experimental methods that are appro¬ 
priate for realising these assumptions. Beginning with Chapter 14, 
various practical problems relating to the construction of parallel tests, 
criteria for parallel tests, experimental methods of determining reliabil¬ 
ity, etc., will be considered. It is felt that postponing such practical 
considerations until the latter part of the book has the advantage of 
giving the student a firm foundation in theory first. Then on the basis 
of this familiarity with the ideal situation, various practical procedures 
can be evaluated in terms of the closeness with which they approximate 
the theoretically perfect method. To consider practical experimental 
procedures without such a grounding in the theoretical foundation' 
leaves these procedures as approximations to something that is not yet 
clearly stated or understood. 

The basic theoretical material on accuracy of test scores is presented 
in Chapters 2 through 5, which deal with the topics of test reliability 
and the error of measurement. The effect of test length upon reliability 
and validity is considered in Chapters 6 through 9, and the effect of 
group heterogeneity on measures of accuracy in Chapters 10 through 13. 
In these chapters we give only a theoretical definition of parallel tests, 
and we define reliability as the correlation between two parallel forms. 
This simplified presentation of the concept of parallel tests and of re¬ 
liability makes it possible to concentrate on the theory of test reliability 
and test validity before taking up the short-cuts and approximations 
that are frequently used in actual practice. Practical problems of 
criteria for parallel tests are given in Chapter 14, and experimental 
methods of determining reliability when a parallel form is not used are 
considered in Chapters 15 and 16. Methods of scoring, scaling, and 
equating tests are considered in Chapters 18 and 19. Problems dealing 
with batteries of tests are considered in Chapter 20, and problems of 
item selection in Chapter 21. 
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1* Introduction 

We shall begin by assuming the conventional objective testing pro¬ 
cedure in which the person is presented with a number of items to be 
answered. Each answer is scored as correct or incorrect, and a simple 
or a weighted sum of. the correct answers is taken as the test score. 
The various procedures for determining which items to use and the 
best weighting methods will be considered later. For the present we 
assume that the numerical score is based on a count, one or more 
points for each correct answer and zero for each incorrect answer, and 
we turn our attention to the determination of the accuracy of this score. 

When psychological measurement is compared with the type of 
measurement found in physics, many points of similarity and difference 
are found. One of the very important differences is that the error of 
measurement in most psychological work is very much greater than it 
is in physics. For example, Jackson and Ferguson (1941) resorted to 
specially constructed “rubber rulers” in order to reduce the reliability 
of length measurements to values appreciably below .99. The estima¬ 
tion of the error in a set of test scores and the differentiation between 
“error” and “true” score on a test are central problems in mental 
measurement. 

2. The basic assumption of test theory 

It is necessary to make some assumption regarding the relationship 
between true scores and error scores. Let us define three basic symbols. 

Xi ■ the score of the *th person on the test under consideration. 

Ti = the true score of the ith person on this test. 

E{ « the error component for the same person. 

In defining these symbols it is assumed that the gross score has two 
components. One of these components (T) represents the actual ability 
of the person, a quantity that will be relatively stable from test to test 

4 
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as long as the tints are measuring the same thing. The other compo¬ 
nent (E) is an error. It is due to the various factors that may cause a 
person sometimes to answer correctly an item that he don not know! 
and sometimes to answer incorrectly an item that he does know. So 
far, it will be observed, there is no proposition subject to any experi-. 
mental check. We have simply said that there is some number T that 
would be the person’s correct score, and that the obtained score (X) 
does not necessarily equal T. 

It is possible to make many different assumptions regarding the rela¬ 
tionship between the three terms X, T, and E. The one made in test 
theory is the simplest possible assumption, namely, that 

(1) Xi=Ti + Ei or Ei = Xi - T { . 

This equation may be regarded as an assumption that states the rela- * 
tionship between true and error score; or it may be regarded as an 
equation defining what we are going to mean by error. In other words, 
once we accept the concept of a true score'existing that differs from the 
observed score, we may then say that the difference between these two 
scores is going to be called error. 

3. The problem of determining characteristics of true and 

error score 

It may be noted that so far we have but one equation with two un¬ 
knowns (T and E). It cannot be solved to determine the values of T 
and of E for the person. If we test additional people, the situation does 
not become any more determinate. Each new score brings one new 
equation, like equation (1), and also two new unknowns. However, we 
may note that with measures on many persons we would have three 
frequency distributions—the distribution of X’b, of T’a, and of E’a. Let 
us investigate to see if we can learn something about the characteristics 
or parameters of these frequency distributions. Can we determine or 
make reasonable assumptions about the means, the standard devia¬ 
tions, or the intercorrelations of these three distributions? 

There are two equivalent approaches to the problem of determining 
the characteristics of the distributions of T and E. 

1. A definition of error score is given, and the true score is regarded 
simply as the difference between the observed score and the error score. 
Intuitively this approach is somewhat unsatisfying, since the main 
attention is concentrated upon the error part, which is to be ignored, 
and the important component (true score) is just what happens to be 
left over. However, the bade equations can be derived quite simply 
from this assumption. 
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2. The other approach is to define the true score and then to let the 
difference between observed and true score be called error. This ap¬ 
proach is probably intuitively more satisfying, since attention is first 
concentrated upon getting a reasonable true score, and the error is a 
r emaind er. However this approach results in much more difficult 
equations. Since the first approach is the simpler to follow, let us 
consider it next. 

4. Definition of random errors 

In this approach to the problem of determining the means, the stand¬ 
ard deviations, and the intercorrelations of true, observed, and error 
scores, we define more carefully just what is meant by error. In dealing 
with errors of measurement, it is necessary to recognize that there are 
two basic types of error. They are termed random or chance errors, on 
the one hand, and constant or systematic errors, on the other. 

If measurements are consistently larger than they should be or are 
consistently smaller than they should be we have what is termed “con¬ 
stant error.” For example, if a tape measure has stretched with use 
and age, measurements made with it would be smaller than those made 
with an accurate tape, and there would be a systematic negative error. 
The error would be negative because it is customary to measure error 
as “the obtained measure minus the correct measure.” The terms 
random, chance, or unsystematic errors on the other hand refer to dis¬ 
crepancies that are sometimes large and sometimes small, sometimes 
positive and sometimes negative. 

The basic assumptions of test theory deal with the definition and the 
estimation of chance errors. These “random errors” are the only errors 
that will be explicitly considered in test theory. For many purposes, 
constant errors can be ignored, since the process of establishing test 
norms takes care of constant errors that may appear in the gross score 
on the test. 

Since we are dealing with random errors, it is not unreasonable to 
assume that over a sufficiently large number of cases the average error 
of this type will be zero'. We may write this assumption, 

(2) M e - 0, 

and note that the larger the number of cases in the distribution, the 
closer will this assumption be approximated. This equation may also 
be regarded, not as an assumption, but as a part of the definition of 
random errors. By random errors we mean errors that average to zero 
over a large number of cases. Stating more exactly, we can say that the 
mean error will differ from zero by an amount that will be smaller than 
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any assigned quantity however small, il the number of cases is sufficiently 
large. In actual practice, however, it is customary to assume that equa¬ 
tion 2 holds exactly for any particular sample that is being considered. 

Turning to a consideration of the relationship between error score' 
and true score, we can see that there is no reason to expect positive 
errors to occur oftener with high than with low true scores, and that the 
same holds for negative errors. Likewise there is no reason to expect 
large errors to occur oftener with low than with high true scores. It is 
reasonable to assume that as the number of cases increases the correla¬ 
tion between true and error scores approaches zero. We may write the 
equation 

(3) tte — 0 

and note that it comes closer and closer to being correct as the number 
of cases increases. Like equation 2, equation 3 is not so much an assump¬ 
tion as a definition. If the errors correlate with true score, they are not 
random errors. In such a case there is a systematic tendency for per¬ 
sons with high scores (or low scores) to have the larger errors. In prac¬ 
tice it is assumed in testing work that equation 3 holds for any given set 
of test data. 

The only other equation needed to define random error relates to the 
correlation between error on one test and error on another parallel test. 
As before, we can point out that there is no reason to expect a relation¬ 
ship, and that if a relationship existed between error scores on one test 
and error scores on a second test, we should have some systematic and 
predictable source of error and not a random error. In other words, by 
definition the correlation between two sets of random errors is zero or 
approaches zero as the number of cases increases. We may use the 
subscripts 1 and 2 to represent any two parallel tests and write 

(4) ^EiEt = 0. 

Again, as before, we note that strictly speaking this assumption is true 
only as the number of cases approaches infinity. In practice, however, 
it is assumed to hold for any given set of test data. 

We may summarize the foregoing material in the following three 
definitions of random error: 

The mean error is zero (equation 2). The correlation be¬ 
tween error score and true score is zero (equation S). The 
correlation between errors on one test and those on another 
parallel test is zero (equation 4). 

We should finally note again that actually these definitions do not hold 
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unless the number of cases is very large, but that in practice it is cus¬ 
tomary to assume that they hold for any given set of test data. 

5. Determination of mean true score 

In order to determine the mean true score we note from the definition 
of equation 1 that 

( 5 ) Ti = Xi - Ei. 

Summing both sides gives 

(0) E ^- £) (Xi - Ei). 

tm rl t'«=l 

Removing the parentheses and omitting the subscripts and limits (since 
they are all identical), we have 

(7) ST - 2X - 2 E. 

Dividing by N (the number of cases) to obtain the mean gives 

(8) Mt — Mx ~ Me. 

Using equation 2, we can see that 

(9) M t - M x . 

The mean true score equals the mean observed score. Equa¬ 
tion 9 is based only on the definition of equation 1 and the 
assumption of equation 2. 

6. Relationship between true and error variance 

Next let us determine the relationship between the standard devia¬ 
tions of the true, the error, and the observed scores. From the defini¬ 
tion of equation 1 and from equation 9 we may write 

(10) X-M x = T + E- M t . 

Let us use lower-case letters to represent deviation scores. That is, 

(11) x = X - M x . 

(12) t<=T — Mt, 

and since Ms equals zero from the definition of equation 2, we have 

(13) e = E. 

Substituting equations 11,12, and 13 in equation 10 gives 

(14) * ' ® * t -f* e. 
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Squaring and summing gives 

(15) Zz 2 - Z (f 4- e) 2 . 

Removing the parentheses gives 

Zz 2 - S< 2 + Se 2 + 2Z let. 

Dividing both rides by N, we have 

(16) s* 2 — 8 2 s« 2 4- 2 rt t 8(8 t . 

Substituting the definition of equation 3 in equation 16 gives 

(17) 8 X 2 - 8 2 + s 2 . 

The variance of the observed scores is equal to the 8um of the 
true variance and error variance. It should be noted that 
equation 17 may be derived solely from the definition of equcb 
tion 1 and the assumption of equation S. 

The relationship between these three variances is shown in Figure 1. 
Such a diagram can readily be constructed to cover any particular set 



Fratnoc 1 . Computing diagram for observed, true, and error variance (*«* *» s? 4'*«*h 
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of values for the three variances. The true variance and the error 
variance are indicated, one on the abscissa, the other on the ordinate. 
The set of diagonal lines indicates points such that the sum of the true 
and the error variance is constant; hence each diagonal line can be 
marked with the appropriate value of s* a . 



Fiquhe 2. Computing diagram for observed, true, and error standard deviations 

(** - V a t 2 + 7 }). 


The computing diagram of Figure 1 is the basic one for addition of 
two quantities. The x and y scales should be set up to cover the appro¬ 
priate range of values. If the two scales are the same, a set of 45-degree 
lines can hs marked to indicate the sum of the x and y values. If it is 
not feasible to use the same units on the x and y scales, the slope of the 
"sum” Hies must be different from 45 degrees. 

Hnerror variance (« e 2 ) or the error of measurement («,) is a funda- 
mwil and important characteristic of any test. If the error of meas- 
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urement can be reduced, the test has been improved. If any factors 
operate to increase the error of measurement, the test has been made 
poorer. Much of the effort in test construction, test revision, test 
analysis, and the precautions of test administration are for the purpose 
of decreasing the value of s,. 

Taking the square root of both sides of equation 17 gives 

(18) 8* = Vs, 2 + 8« 2 . 

This is the familiar Pythagorean theorem: “The hypotenuse of a right 
triangle is equal to the square root of the sum of the squares of the two 
sides.” s, and s e could be diagramed as the sides of a right triangle, and 
8* would be the hypotenuse. 

This relationship can be utilized to construct a simple computing 
diagram for the foregoing equation. Draw a series of concentric quarter 
circles, as illustrated in Figure 2, including the range of values of 8, and 
$e likely to be found in the data being considered. Find 8, and s e on 
the vertical and horizontal axes. From the point of intersection follow 
the circle to either axis and read off s*. For example, the dotted lines 
show that if s, = 4 and s e = 3, then s x = 5. 

7. Definition of parallel tests in terms of true score and 
error variance 

From a common sense point of view, it may be said that two tests 
are “parallel” when “it makes no difference which test you use.” It is 
certainly clear that, if for some reason one test is better than the other 
for certain purposes, it does make a difference which test is used, and 
the tests could not be termed parallel. However, this simple statement, 
“it makes no difference which test is used,” must be cast in mathemati¬ 
cal form before we can use it in any derivations. 

It would seem to be clear that if the true score of a given person (t) 
on one test is different from his true score on a second test, we cannot 
say that the two tests are parallel. In other words, if we designate the 
person by the subscript i and the two tests by the subscripts g and h, 
we can say that two tests are parallel only if 

(19) T it = T ik . 

The true score of any person on one test must equal the true 
score of that person on the other parallel test. 

Equation 19, however, is not the only requirement for parallel tests. 
If the difference between observed score and true score is in general 
much greater for one test, it is clearly better to use the other test. And 
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we cannot say, “It makes no difference which test is used.” Also it is 
unreasonable, in the light of what has been said previously, to require 
that Ei t «' Eih. This statement would contradict the definition of 
. equation 4, which says that the errors on one test correlate zero with 
the errors on another test. If for each person (*) the error score on one 
test equaled the error score on another test, the correlation between 
errors would be 1.00 and not zero. The closest we can reasonably come 
to defining the errors to be alike on parallel tests is to require that the 
standard deviation of the errors on one test equal the standard devia¬ 
tion of errors on the other test. Since this can be true when the correla¬ 
tion is aero, it does not violate any other assumption that has been 
made. Thus, we may write the second equation defining parallel tests, 

(20) s„ = s n . 

Ibis equation may be stated in the definition: 

For two parallel tests, the errors of measurement are equal. 

Equations 19 and 20 will serve to define parallel tests. 

It will be noted that both equations 19 and 20 are stated in terms of 
hypothetical quantities. In other words, these equations do not pro¬ 
vide for testing the actual means, standard deviations, or intercorrela¬ 
tions of a set of tests and for determining whether or not they are paral¬ 
lel tests. Let us see what can be determined from equations 19 and 20 
regarding the parameters of the observed scores. 

If all true scores are identical, it follows that the means and standard 
deviations of true scores are also identical for parallel tests. We may 
write 

(21) Mr. •Mr, or T t - T k 
and 

(22) s T , = s n . 

The correlation lietween two sets of identical scores is unity; therefore, 

(23) r T , n = 1.00. 

Applying equation 17 to tests g and h, we see that 

(24) s Xt 2 = s T , 2 + s B 2 , 
and correspondingly for test h 

(25) sxk * + *v • 
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From equations 290 and 22 we see that 

(28) *x. 2 = 

From equations 21 and 9 it follows that 

(27) M x , = M Xh or X t = X h . 

w The means and standard demotions of parallel tests are equal . 1 

We turn next to consideration of the problem of the correlation 
between parallel forms of a test. 


8. Correlation between parallel forms of a test 
For the present we shall define reliability as the correlation between 
two parallel forms of a test. The problem of determining whether or ✓ 
not two forms are parallel and of the best method of estimating the 
correlation between the two forms will be considered later. 

The correlation between two parallel forms of a test may readily be 
found by using the deviation score formula for correlation: 


(28) 


N S g 8 h 


From equation 14 we may express the numerator of the right side of 
equation 28 as follows: 


(29) XxgXh = 2 {t g + e g )(t h + e h ). 


Expanding and removing the parentheses gives 


(30) 2 XgXh = "Stgih + 2J*e* + 2foe* + 2 e g eh. 


From the definitions of equations 3 and 4 we see that the last three 
terms in equation 30 are each zero. Since we are dealing with parallel 
tests, the true score on g equals the true score on h. Therefore, equation 
30 may be rewritten as 

(31) 2x t x h = Si, 2 = S i* 2 . 


We may divide both sides of equation 31 by N, obtaining 


(32) 


2x t Xh _ Si, 2 
N * ~N" 


‘ It should be noted that equations 26 and 27 may be interpreted as applying 
either to gross scores or to soores after transformation by one of the methods dta» 
cussed in Chapter 19. However, the set of scores (X,) and the set of soores 0X») are 
not parallel unless the means and standard deviations are about equal Criteria of 
equality are discussed in Chapter 14. 
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From the definition of a standard deviation we see that 


(33) 

Substituting equations 26 

(34) 



and 33 in equation 28 gives 

$t 2 

r x t Xh - 2 * 

S x t 


We see from the reasoning used in developing equation 26 that, if we 
were dealing with several parallel forms, all the observed variances 
would be alike. That is, if we designate the forms by subscripts 1, 2, 3, 
and 4, 


(35) s a . l s X2 s X3 — s X4 . 

Also, by the reasoning used in developing equation 22, we see that 


(36) s fl — s t2 — s t3 — s u . 

From equations 34, 35, and 36 we see that 

(37) f*XtX2 = 1* XlXS == fx 2 ZS == * * ’ = ^X3T4* 

All intercorrelations of parallel tests are equal . 

Equations 26, 27, and 37 show that parallel forms of a test should 
have approximately equal means, equal standard deviations, and equal 
intercorrelations. 1 These are objective quantitative criteria for which 
a statistical test will be presented in Chapter 14. In addition to satisfy¬ 
ing these objective and quantitative criteria, parallel tests should also 
be similar with respect to test content, item types, instructions to 
students, etc. Similarity in these respects can as yet be determined 
only by the judgment of psychologists and of subject matter experts. 


9. Equation for true variance 

Multiplying equation 34 by s x 2 gives 

(38) s t 2 = s x 2 r XgXh (true variance). 


Taking the square root of both sides, we have 

(39) s t = s x \^r XgXh (true standard deviation). 


1 Means and standard deviations may be equal for the raw scores or for scores 
transformed by one of the methods suggested in Chapter 19. For transformed 
scores, it is necessary to determine the transformation equation from one set of data 
and to test for equality of means and standard deviations after the same transforma¬ 
tion has been applied to a new set of data. 
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Equations 38 and 39 give the variance and standard deviar 
tion of the distribution of trite scores in terms of the test reli¬ 
ability and the standard deviation of the distribution of ob¬ 
served scores. These equations may be derived by assuming 
only equations 1 } 3 } 30, and 36. 

10. Equation for error variance (the error of measurement) 

We may solve for the error variance by substituting equation 38 in 
equation 17, obtaining 

(40) s x 2 = s x 2 r XgXh + s e 2 . 

Solving equation 40 for the error variance ( s e 2 ) gives 

(41) s e 2 = s x 2 (l - r XgXh ) 

(variance of the errors of measurement). 

Taking the square root of both sides gives the equation for the standard 
error of measurement, 

(42) s e = s x \^l — r XgXh (error of measurement). 

Equations 1^1 and 42 give the variance of the errors of meas¬ 
urement and the standard deviation of these errors. They 
follow from the assumptions needed to derive equations 38 
and 39. The error of measurement is a fundamental con¬ 
cept in test theory and an important characteristic of a test. 

It was suggested by Kelley (1921) and Otis (19226) that the error of 
measurement was an invariant of a test, that is, it did not vary with 
changes in group heterogeneity. The equations given in Chapter 10 
are based on this assumption. 

The computing diagram used before to indicate the relationship 
s* = V^TZ 2 can also be used with some slight complication to 
compute the true standard deviation and error of measurement, given 
the standard deviation and reliability of the test. Figure 3 indicates 
how this computation is done. Draw the series of circles, as before, to 
indicate the relationship between true, error, and observed standard 
deviation. Radial lines can then be drawn to indicate a given reliability. 
Where each of the horizontal lines (indicating s<) intersects the large 
quadrant with radius 10, we have successively the points for which the 
reliability is .l 2 , .2 2 , • • • .9 2 . Still finer subdivisions can be drawn from 
the general rule that the reliability coefficient is the ratio of the true 
variance to the observed variance. By selecting the quadrant with 
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radius 10 , the division consists in simply pointing over two places 
(standard deviation 10 gives variance of 100). For each point along 
this radius, the reliability coefficient is s ( 2 /100. 

In order to use the diagram, find the radius corresponding to the 
reliability coefficient and the circle corresponding to the observed 



«« 

Abac for $ t ■ s,Yr 

8 t * s*Vl-r 

Figure 3. The true standard deviation, and error of measurement as functions of 
the test reliability and standard deviation. 

standard deviation. Then note the intersection of this particular radius 
and circle. The ^-coordinate of that point is the true standard devia¬ 
tion, the x-coordinate of that point is the error of measurement for 
that test. 

For example, the point marked with a circle is where the reliability 
is .64, and the standard deviation of observed scores is 7. In this 
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case the standard deviation of the true scores is 5.6 (7 X \/j 64 ), and 
the error of measurement is the ^-coordinate of the same point, 4.2 
(7 X V^.36). 

If we know the true and error standard deviations, it is possible to 
use the same diagram to read off the reliability and the observed stand¬ 
ard deviation. In this case, look up the true standard deviation on the 
y- axis, the error of measurement on the x-axis, find the point with these 
two coordinates, and then read the value of the radius through this 
point to find the test reliability, and the value of the circle through this 
point to find the standard deviation of observed scores. 

In general we may describe this graph by pointing out that it is a 
combination of rectangular and polar coordinate systems. Any point 
in it represents four quantities, one on the x-axis, one on the y-axis (the 
rectangular coordinate system), one on the radius, and one on the 
circle (the polar coordinate system). Each point, which represents 
four quantities, is determined by any two of these quantities. There¬ 
fore, generalizing, we may say that given any two of the four quantities 
involved, it is possible to determine the other two. 

11. Use of the error of measurement 

It should be noted that, in the development of the equations of this 
chapter, there is no reference to any particular type of frequency dis¬ 
tribution. It is assumed that the average error is zero, and that errors 
are uncorrelated with each other or with true score. All the equations 
of this chapter follow from these assumptions, regardless of the fre¬ 
quency distribution of true scores, error scores, or observed scores. 
However, it is not possible to make use of the error of measurement 
without some assumption about its distribution. This quantity is 
usually assumed to be normally distributed. Figure 4 gives a concrete 
illustration for a particular test in which the error of measurement is 
assumed to be 5 score points. Observed scores are indicated on the 
base line. A is the distribution of observed scores for all persons whose 
true score is 50 points. It will be noted that the mean is at 50, the in¬ 
flection points at 45 and 55 (50 plus and minus 5), and that all but a 
negligible part of the distribution lies between scores of 35 and 65 (50 
plus and minus three times 5). That is to say, for the group of persons 
whose true score is 50, over 99 per cent of the observed scores will lie 
between 35 and 65. 

We may also indicate a method for assigning a probability value 
to the statement: “If a person's observed score is 65, his true score 
probably lies between 50 and 80.” Note that no probability value is 
given. We cannot say that for all persons whose observed score is 
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65, the probability is greater than .99 that the true score is between 50 
and 80. However, consider the statement: “This person's true score 
lies between 50 and 80." If the person's true score were known, the 
statement could, for a given person, be classified as true or as false . It 

A * distribution of observed scores for those with a true score of 50. 

B • distribution of observed scores for those with a true score of 80. 

(Standard error of measurement is 5.0 for both distributions.) 


A B 



Observed score 



10 20 30 40 50 60 70 80 90 100 110 

Observed score 


Figure 4. Illustration of a significant difference between two test scores. 


would be possible to keep on following this rule of procedure. For 
each person (some with one observed score, and some with another), 
we could say : “This person's true score lies between TV and TV," where 
TV, the lower limit for the true score, is found by subtracting three 
times the error of measurement from the observed score, and TV, the 
upper limit for the true score, is found by adding three times the error 
of measurement to the observed score. For each of the persons whose 
observed score is known, such a statement can be made. In addition, 
':||e statement so made can be labeled true or false. For all the persons 
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in the distribution, it will be found that the statement regarding limits 
is true over 99 per cent of the time and is false less than three times out 
of a thousand. In other words, if all the cases are considered, a prob¬ 
ability can be attached to the truth or falsity of the statement that 
“true score is included within the specified limits. ,, However, if we 
limit the statement to persons with any given observed score , no asser¬ 
tion regarding probability can l>e made. 

The situation shown in the upper part of the diagram for persons 
with a true score of 50 and of 80 can be generalized if the correlation 
scatter plot is given. The lower part of the diagram shows the scatter 
plot of true score against observed score. The heavy line marks the 
cases in which observed score and true score are equal. Two lighter 
lines are drawn on each side of the heavy one to indicate those cases in 
which the observed score is equal to the true score minus three times' 
the error of measurement, and in which the observed score is equal to 
the true score plus three times the error of measurement. For true 
scores of 50 and 80, again we can read the limits within which over 
99 per cent of the observed scores will fall. We can also see that the 
observed score of 65 could reasonably be from any distribution where 
the true score was over 50 or under 80. Likewise, we may pick any 
observed score, such as 60 in the diagram, and see that such a score 
might have arisen from any true score over 45 and under 75. That is 
to say, take three times the error of measurement, or 15; then add 15 
to 60 and subtract 15 from 60, obtaining the upper and lower limits of 
75 and 45. Thus if we know the error of measurement of a test it is 
possible, for any given observed score, to specify limits within which 
the true score lies. We can also say that such a statement is true for 
99.7 per cent of the cases (and false for 0.3 per cent of the cases) in the 
entire distribution. However, no such probability statement can be 
made which applies to the group of persons making any specified ob¬ 
served score. Distribution B shows the same information for persons 
whose true score is 80. Since it is assumed that the standard error of 
measurement is constant regardless of true score, distribution B also 
has a standard deviation of 5 points. For persons whose true score is 
80, the observed scores, in over 99 per cent of the cases, will lie between 
65 and 95 (80 plus and minus three times 5). It will be noted that if a 
person’s observed score is 65, he might reasonably be either a top¬ 
scoring person from distribution A, or a very low-scoring person from 
distribution B . In other words, if a person’s observed score is 65, his 
true score could reasonably be as low as 50 (65 minus three times 5), or 
as high as 80 (65 plus three times 5). His true score might also reason¬ 
ably have been any value between 50 and 80. However, if a person’s 
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true score is below 50, there is considerably less than two chances in a 
thousand that his observed score would be 65 or higher. Conversely, 
if a person’s true score is above 80, there is considerably less than two 
chances in a thousand that his observed score would be 65 or less. We 
say then that for a person whose observed score is 65, reasonable limits 
for his true score are 50 to 80. 

It should be noted that, although we can assign reasonable limits for 
the true score, we cannot say that for all persons who score 65, over 99 
per cent will have true scores between 50 and 80. In general no prob¬ 
ability statements can be made to apply to all persons who make a 
given observed score. We can only make the statement the other way 
around. For all persons with a given true score, the probability is over 
.997 that the observed score will lie within plus or minus three times 
the error of measurement from that true score. Likewise, for all per¬ 
sons with a given true score, the probability is less than .003 that the 
observed score will lie outside the range given by the true score plus 
and minus three times the error of measurement. 

For persons with any given observed score A r t , reasonable 
limits for the true score Ti may be taken as 

(43) X t 4" cs e > Ti > X{ — cs ey 

where c is taken as equal to 2 or 3 and s c = s x ^/\ — r XgXh , 

The error of measurement may also be utilized to determine reason¬ 
able limits for the difference between the true scores of two persons. In 
this case we utilize the difference between two scores, Xi — x 3f and the 
standard error of that difference s 6t - e] . To write the formula for this 
error, we use equation 14 and write 

(44) Xf Xj =s ti tj -}- (cj e 3 ). 

The terms in parentheses indicate the error. Thus the variation of the 
observed difference from the true difference is indicated by 

(45) ~ ej) 2 « 2 a 2 + he 2 - 22 W 

From equation 4, we see that the last term of equation 45 is zero. Using 
equation 41 in equation 45, we have 

(46) 2(* - o) 2 = 2JV«, 2 (1 - r Xt J. 

Dividing by N and taking the square root, to get the standard error of 
difference s ei _ e/ , we have 

s Ci~e/ = S*V2 V1 fx t Xh' 


(47) 
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Illustrating again, using the distribution of Figure 4, the standard 
error of which is 5, we have the standard error of the difference of two 
scores as 5y/2. Figure 5 illustrates the frequency distribution of ob* 
served score differences for persons with true score differences of —25, 0, 
and +25. In each case the distribution is shown as extending 3 X 5\/2 
above and below the true score difference. 



Observed score 



-60 —40 -20 0 + 20 + 40 + 60 

Observed score 

Figure 5. Illustrating the standard error of a difference of observed scores. 


Again it should be noted that in developing the equation for the 
standard error of a difference, no assumptions were made regarding the 
distribution of errors. However, in order to utilize this error to obtain 
reasonable limits for the value of the difference between true scores, 
some assumption regarding the frequency distribution of errors is needed. 
In Figure 5 the usual assumption of a normal distribution of errors is 
made. 

As before, we may generalize for all possible distributions as shown in 
the lower part of Figure 5. It shows that if the observed score differ¬ 
ence is zero, the true score difference may be as high as +3 X 5\^2 or 
as low as -3 X 5\/2. If the observed score difference is as large as 22, 
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the entire range from 22 + 3 X 5\/2 to 22 — 3 X 5\/2 does not 
include zero. Hence it is not reasonable to assume that there is no dif¬ 
ference between the two true scores represented by these observed 
scores. In such a case it has become customary to say that there is a 
significant difference between the two scores. When the difference 
between two scores is less than 3 X 5\/2, the range of possible true 
score differences will include zero; and it is conventional to say that 
there is not a significant difference between the two scores. This means 
that zero difference is one of the possibilities. 

For persons mth any given observed score difference (x * — Xj), 
reasonable limits for the difference of true scores (ti — tf) may 
be taken as 

(48) Xi — Xj + cs e y/2 > ti — tj > x t - — Xj — cs e \/2, 

where c is taken as equal to 2 or 3, and s e = s x y/ 1 — r XgXh . 

If these limits include zero , there is no significant difference 
between x t and Xj. If the limits are both positive , or both 
negative , there is a significant difference between and Xj. 

It should be noted that we are discussing the comparison of single 
cases. For us to be certain that Mr. A’s score is different from Mr. B’s 
score there must be a very large difference between the two scores. 
However, when we are setting up a selection policy that is to be used 
on several hundred cases, it is legitimate, for example, to accept every¬ 
one with a score of 76, or higher, and reject everyone with a score of 
75 or lower. The average true score of a hundred persons scoring 76 
will be higher than the average true score of a hundred persons scoring 
75, so that in the long run better persons will be accepted and poorer 
ones rejected. However, the magnitude of the error that may be made 
in a single case or the percentage of errors that will be made in a large 
number of cases is indicated by the error of measurement. 


12. Correlation of true and observed scores (index of reliability) 
In order to obtain the correlation between true and observed scores 
we begin with the basic equation for correlation, writing the correlation 
of observed and true scores as 


(49) 


Txt = 


2 xt 


N s x st 


Substituting equation 14 in equation 49 gives 


(50) 


2 (t + e)t 
Ns x s t 



23 


Chap. 2] Equations Derived from Definition of Error 


Removing parentheses, we have 


(51) 


2< 2 + 2 te 
Ns x s t 


Dividing each of the terms in the numerator by the N in the denomina¬ 
tor gives 


(52) 


St 2 + r te StS e 

r xt =- 

S x St 


Since the correlation between deviation scores is identical with the 
correlation between gross scores, we can see from equation 3 that the 
second term of the numerator in 52 vanishes. Dividing both numerator 
and denominator by s t gives 

(53) r xt = —• 


Substituting equation 39 in equation 53 and canceling s x from numerator 
and denominator gives 

(54) r xt = y/r XgXh (index of reliability). 

The foregoing formula was given by Kelley (1916). 

The correlation between observed scores and true scores as 
given by equation 51+ is known as the index of reliability . 

The test validity must always be less than the index of reli¬ 
ability. 


13. Correlation of observed scores and error scores 


Just as in considering the correlation between observed and true 
scores we begin here with the basic deviation score formula for correla¬ 
tion, 


(55) 


2xe 
N s x Sg 


As before, substituting equation 14 in equation 55 gives 


(56) 


2(t + e)e 
Ns x s e 


Following the same procedure as in the preceding section, we remove 
parentheses, obtaining 

2 te + Se 2 

N s x Sg 


(57) 


Txe = 



24 


The Theory of Mental Tests 


[Chap. 2 


■ P 



Kioube 6. Relationship between test reliability, index of reliability, and correlation 
of observed and error scores. 


Dividing each of the terms in the numerator by the N in the denomina¬ 
tor gives 

r l€ * t s e + s e 2 

(58) Txe — * 

Again from equation 3 wc see that the first term in the numerator 
vanishes, and we may then divide numerator and denominator by 6* c , 
obtaining 

(59) r xe = -• 

Substituting equation 42 in equation 59 and dividing numerator and 
denominator by s x gives 

( 60 ) Txe = ^ 1 Txgxk' 
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Equation 60 gives the correlation between observed scares and 
error scores as a function of the reliability coefficient From 
equation 42 , we see that r xe is equal to the standard error of 
measurement when the test standard deviation is taken as 
unity . In Chapter 19 it is shown that the standard device 
lion of standard scores is equal to unity. Thus equation 60 
gives the error of measurement for a standard score. 

The relationship between the reliability coefficient, the index of re¬ 
liability, and the quantity Vl — r XgXh is shown in Figure 6, which is 
similar to Figure 3 showing the relationship between reliability, stand¬ 
ard deviation, and error of measurement. 

The x - and y- axes are scaled by tenths, or finer, from zero to one. 
One of these axes gives the index of reliability, and the other the cor¬ 
relation between observed and error scores. A quadrant is drawn with 
radius unity, and is scaled in terms of the reliability coefficient. Any 
point on this quadrant then represents a given reliability, index of re¬ 
liability, and correlation between observed and error scores. With 
any one of the values given, the other two can be read from the graph. 

14. Summary 

The material in this chapter has been based upon the following def¬ 
initions : 

1. Definitions, from elementary statistics, of the mean, standard 
deviation, and correlation, 


2X 

M x = - 
N 

Sz 2 



S xy 

r *v ~ xr 

NS X S y 

2. Definition of the relationship between observed, true, and error 
score, 

(1) X* - Ti + E u 

or its equivalent 


(14) 


x — t + e. 
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3. Definition of random errors, 

(2) Me - 0, 

(4) r El E 2 = 0, 

(3) vte — 0. 

4. Definition of parallel tests, 

(19) T ig = T ihf 

(20) s e<f = s eh . 

From these definitions the following equations have been derived: 

1. Since the true score for each individual is the same on parallel 
tests, it follows that 

(21) M Tr = M Th) 

(22) s Tg = s Th , 


2. From the two equations defining parallel tests, it is shown that 
the observed scores on parallel tests must satisfy the following 
characteristics: 

57) M Xg = M Xh , 




3. The variance of true scores, observed and error scores are related 
by the equation 

L7) Sx 2 = S t 2 + S e 2 


or 



4. It has been shown that the mean true score and the standard 
deviation of both true and error scores can be expressed in terms 
of observable quantities as follows: 

(9) M t = M x , 



(error of measurement). 
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5. It has been shown that the correlation of observed scores with 
true and error scores can be expressed in terms of observable 
quantities as follows: 

(54) r xt = V^ IA (index of reliability), 

(60) r xe = VT - r XgXh . 

Among the foregoing quantities, the error of measurement is the most 
significant and the most generally useful. Illustrations have been given 
of how the error of measurement can be used to assign limits between 
which the person’s true score is very likely to be found. 

Problems 

1. Give the error of measurement, standard deviation of true scores, correlation 
between observed and error scores, and the index of reliability for each of the follow¬ 
ing tests: 


Test 

Number 
of Items 

Means 

Standard 

Deviation 

Relia¬ 

bility 

A 

50 

100 

15 

.91 

B 

100 

211.6 

25.7 

.84 

C 

80 

57.4 

11.3 

.78 

D 

700 

361.9 

76.5 

.87 

E 

200 

127.4 

21.9 

.76 


2. Assume a normal distribution of error scores. Give the true score limits (approx¬ 
imately 0.3 per cent level) for persons making each of the following scores: 

(a) A score of 115 on test A. 

(b) A score of 211 on test B. 

(c) A score of 31 on test C. 

( d ) A score of 500 on test D. 

(e) A score of 100 on test E. 

3. For each test A through E what is the minimum difference between the observed 
scores of two individuals that will give reasonable assurance that they do not have 
the same true scores? 
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Fundamental Equations Derived 
from a Definition of True Score 


1. Definition of true score 

Again let us begin with basic equation 1 of Chapter 2 (X = T + E) 
and put it in the form 

(1) E = X — T. 

In other words, the error is defined as the difference between true score 
and observed score. Then, if “true score” is defined, the error can be 
determined. True score is defined as 


K 

(2) T,= 

K-* • A 

that is, the true score for a given person (i) is the limit that the average 
of his scores on a number of tests approaches, as the number of parallel 
tests (K) increases without limit. These tests are designated by the 
subscript g, which varies from 1 to K. 

2. Definition of parallel tests 

In addition to the definition of true score, a definition of parallel 
tests is needed. Instead of defining parallel tests in terms of true and 
error score (as we did in the chapter immediately preceding), and then 
deriving the observed score characteristics of parallel tests, we shall 
define parallel tests this time in terms of observable characteristics. 

Again we may begin with the basic definition that tests are parallel 
if “it makes no difference which one is used.” Clearly such a definition 
requires that the means and standard deviations of a given group of 
subjects be equal regardless of which test is used. Expressing this in 
equational form, we have 

(3) — ^2 “ ^3 = X 4 = • • • = Xk 

28 
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and 

( 4 ) = $ 2 888 $3 “ s 4 *** * * * 55 Sic- 
Also it is clear that 

(5) r l2 = r 13 = r 14 = r 23 = r 24 = r 34 , 

since if one of these correlations were higher than the others, the “pre¬ 
diction” from one of these tests to the other would be higher than could 
be obtained by using other combinations of two tests. “It makes no 
difference” which test is used only if all the intercorrelations between 
parallel forms are equal. In order to be useful, this correlation must be 
high for parallel tests. That is, the standard deviation of the true 
scores must be considerably greater than the standard deviation of the, 
error scores. 

The basic definitions used in this chapter are given in equations 1 to 5, 
inclusive. From these equations we can, without any further assumptions f 
derive all the equations given in the preceding chapter. 

3. Determination of mean true score 

The mean true score is obtained by summing equation 2 over the 
number of cases (. N) and dividing by N. 

N K 

E E ** 

(6) M r = - - (K - oe). 

NK 

Since the order of summation makes no difference, we may substitute 

N 

M g for 2 A r ;(l/iV) and write 

*»i K 

ZM g 

(7) Mr = (K -> oo). 

A 

By the assumption of equation 3 all the means are equal for parallel 
tests. Therefore, in the summation, M s or may he treated as a 
constant, so that S M g - KM e . Substituting this value in equation 7 
gives 

(8) Mr = M g = Xg. 

The mean observed score is equal to the mean true score. 
Equation 8 is based only on the definition of equation 2 and 
the assumption of equation 3. 
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4. Determination of true score variance 

From the definition of a standard deviation, we may write 


[Chap, 3 


Sf 2 = 


Z (7, - M t ) 2 
1 


In order to simplify the term in parentheses, we can make use of equa¬ 
tions 2 and 8, obtaining 


Ti - M t « 


- M g . 


We omit here the proviso, stated in equation 2, that the true score is 
the limit of the average of a large number of tests as “K approaches 
infinity. ,, This proviso is omitted for convenience in the following 
equations. It will be introduced explicitly again in equation 18. It 
should be noted that nothing is done in the meantime, in equations 11 
through 17, to invalidate carrying the assumption through. 

In order to simplify the notation, let us use t to designate a deviation 
score and put all the right-hand term over one denominator, obtaining 

K 

£ Xt, - KM g 


The numerator of equation 11 may be written as shown below and the 
equation expressed as follows: 


£ (X lg - M e ) 


We may substitute x for X — M and write out the numerator to avoid 
the summation sign, obtaining 

xi + x 2 + x 3 H-h xk 


The value of t in equation 13 may be substituted for T — M in equa¬ 
tion 9: 

N 

£ (Z»l + Xi2 + Z»3 H-h Xtff) 2 


( 14 ) 
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Expanding the numerator and omitting the limits on the summation 
sign, since all summations are over N , we can write 


(15) 


«r = 


S^i 2 + Sx 2 2 + Sx 3 2 + • • • + 2#jr 2 + 2 xix 2 + 2 xiX 3 + • • • 
+ 'Zxkxk-x 


NK 2 


We may substitute Ns 2 for 2x 2 , and Nri 2 SiS 2 for 2x^2, and correspond¬ 
ingly for other products. This substitution gives 


(16) 


Nsi 2 + Ns 2 2 + N s 3 2 H-b Nsk 2 + Nri 2 sis 2 


+ Nr l3 sis 3 H-b NrK(K-i)SKSK-i 

__ 


We may divide the numerator and the denominator by N. Since from 4 
equations 4 and 5 we see that, according to the definition of parallel 
tests, all standard deviations are equal and all intercorrelations are 
equal, we may substitute s 8 2 for each of the variances and r g h for each 
of the intercorrelations. 1 These substitutions result in 


(17) 


St 


Ks 2 + K(K - 1) r th * t 2 
K 2 


Dividing the numerator and the denominator through by K and sepa¬ 


rating terms gives 



As was noted in equation 10, we neglected to specify that K should 
approach infinity, as was done in equation 2 defining the true score. 
If we now introduce this part of the definition of true score, 

(19) s 2 = r gh s 2 (true variance). 

If we take the square root of both sides of equation 19, we have 

(20) s t = SgV: r gh (true standard deviation). 

Equations 19 and 20 give the variance and standard device 
tion of the distribution of true scores in terms of the test reli¬ 
ability and the standard deviation of the distribution of ob¬ 
served scores. These equations may be derived by assuming 
only equations 2 , 4) an d 

1 Note that r g h is the correlation between any two parallel tests (g and k) t and 
therefore is a reliability coefficient. It is identical with T Xi x h1 which was used for the 
reliability coefficient in Chapter 2. 
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It should be noted that equations 10 and 20 are the same as equations 
38 and 39 given in Chapter 2 for true variance and true standard devia¬ 
tion. However, in the preceding chapter these theorems were derived 
from assumptions about “random errors” and a definition of “parallel 
tests” in terms of true and error scores. In this chapter the same 
theorems are derived from a definition of “true score” (equation 2) 
and from a definition of parallel tests that depends upon observed scores. 


5. Correlation of true with observed scores (the index 
of reliability) 

Using the usual formula for correlation and the definition of true 
score as the average, of scores on a large number of parallel tests (see 
equation 2), we may write 


2zi(xi + *2 H-b * k ) 

(21) r xt =-—- 

NKs v s t 

Removing the parentheses gives 

Sari 2 + SX 1 X 2 + 2x^3 H-1- SjjX* 

(22) r xt -—-- 

NKsiSt 

Dividing both numerator and denominator by N gives 
Si 2 + fi2Si*2 + ri3SiS 3 H-h r 1K SiSK 


(23) 


Txt = 


Ksis t 


Canceling si from numerator and denominator, we have 
Si + r 12 s 2 + r 13 s 3 + • • • + TikSr 


(24) 


Txt = 


Kst 


(K oo). 


(K -> oo). 


(K -> oo). 


(K -> oo). 


It will be noted that there are K — 1 terms involving r. Since the 
sum of a series of terms is equal to the number of terms multiplied by 
the average term, we may write 


(25) 


T x t 


+ (K - 1 )rs 
Ks t 


(K - oo), 


where ra equals the sum of the terms r g *s s divided by K — 1. Substi¬ 
tuting for s t its value from equation 20, we have 


(26) 



Txt ~ 


s g + (K — 1 )tjhfig 
KsgVrgh 


(K -> oo). 


According to equations 3, 4, and 5, one of the ra products may be 
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substituted for the average of an infinite number. Making this substi¬ 
tution, dividing numerator and denominator by s g , and separating the 
terms gives 



If we let K approach infinity, equation 27 becomes 

(28) r xt = y/r g h (index of reliability), 

where r g h is the reliability of the test. 


The test validity must always be less than the correlation be¬ 
tween observed scores and true scores } or the index of reli¬ 
ability as given by equation 28. 

Again it may be noted that this equation is identical with the equation 
given in the preceding chapter for the correlation between true and 
observed scores. However, it is derived from one set of assumptions 
in Chapter 2 and from another set in Chapter 3. 

6. Average error 

Using equation 1 for errors, we may sum it over persons from 1 to N . 
Since all summations are over persons from 1 to N , no ambiguity will 
arise if subscripts are omitted and the limits of summation are not 
indicated. We write 

(29) 2E * 2X - 2 T. 

Dividing equation 29 by N, we have 

(30) Me = M x — Mt> 

Substituting equation 8, we see that 

(31) Me = 0. 

The average error is zero. 

7, The standard deviation of error scores (the error of 
measurement) 

Using the usual formula for standard deviation, we may write from 
equation 1, noting that the differences of deviation scores equal the 
differences of gross score, 


2(* - t) 2 
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Expanding the numerator, we have 

9 Sz 2 + 2 t 2 - 2 Zxt 

(33) s e 2 =--- 

N 

Dividing through by N gives 

(34) s e 2 = s x 2 + s t 2 — 2 r xt s x s t . 

Substituting equations 20 and 28 in equation 34 gives 

(35) s e 2 = s x 2 + s g 2 r gh — 2Vr^MxVr gA . 

Combining terms in equation 35 and simplifying, we get 1 

(36) s e 2 = s x 2 - s x 2 r gh} 
or 

(37) s 2 = s x 2 (l — r g h) (variance of the errors of measurement). 

Taking the square root of both sides of equation 37, we have the usual 
formula for the error of measurement, 

(38) $ e = s x y/l — r gh (error of measurement). 

where s x is the standard deviation of the distribution of gross scores, and 
r g h is the test reliability. 

Equations 87 and 38 are the same as equations 41 and 42 in 
Chapter 2. In this chapter the error of measurement is de¬ 
rived from the assumptions of equations 1, 2, 4> and 5 . 

8. Relationship between true and error variance 
By adding equations 37 and 19 we obtain 

(39) s t 2 + s e 2 = r gh s g 2 + s* 2 (l - r gh ). 

If we factor out s g 2 , we have 

(40) s t 2 + s e 2 = s g 2 (r gh + 1 - r gh ), 
which equals 

(41) S g 2 = S t 2 + S e 2 . 

The true variance plus the error variance is equal to the ob¬ 
served variance . 


1 Note that for parallel tests s Xf s g , and Sh may be used interchangeably. 
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Again it may be pointed out that the last three theorems involving 
error variance were proved from a different set of assumptions in 
Chapter 2. 


9. Correlation of errors and observed scores 

Using the usual formula for correlation, we can write—from equa¬ 
tion 1, 

2a; (:r — t) 

(42) r “ = Ns~s 


Expanding and factoring out N as before gives 

« 2 


(43) 


Tex — 


SaS e 


Factoring out s g and substituting from equations 20, 28, and 38 gives 


(44) 


Tex — 


Sg - Vr**; 


rghSg v r th 


s g V 1 - r gh 
Dividing through by s g again, we have 

1 — r eh 


(45) r„ = 
which is equivalent to 

(46) r ex = 


VT ’ 

VT — Vghy 


where r g h is the test reliability. 

Equation 1+6 is identical with equation 60 of Chapter 2 . It 
is equal to the standard error of measurement when the test 
standard deviation is unity. Since the standard deviation of 
standard scores is equal to unity (see Chapter 19), equation 
1+6 is sometimes referred to as the error of measurement for 
a standard score. 


10* The correlation between true score and error score 


This correlation is derived by exactly the same procedure as in the 
last section. We first write 


(47) 


2t(x - t) 

Ns t s e 


Expanding equation 47 and dividing by N, we have 

, 2 

r*~&4is * — a 

(48) r, t 


rtxStSx - s< 
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Substituting the values of r tx , &t, and a t 2 3 4 from equations 19, 20, and 28, 
we have 


(49) 


>'et = 


r r gh - Sg 2 r gh 




( ,s g — s x)- 


it can be seen that, the numerator is equal to zero, and therefore 


(SO) 


Tel = 0. 

The correlation between error scores and true scores is zero . 


This theorem, which is proved from the definitions of true score and 
parallel tests, is identical with one of the assumptions made in defining 
random error in Chapter 2. 


11. Summary 

In this chapter we began with a definition of true score and a definition 
of parallel tests in terms of observed scores. From these definitions the 
fundamental theorems of test theory were derived. By comparison we 
see that they are the same as those in Chapter 2, except that a different 
set of equations was chosen for the assumptions. 

First error was defined as the difference between the observed and 
the true score: 

(1) E = X — T. 

True score of a given person was defined as the average of his scores 
on a number of parallel tests, as this number increases without limit. 

K 

(2) T t = lim - 

K oo K 

Parallel tests were defined as tests with equal means, standard devia¬ 
tions, and intercorrelations: 

(3) = X 2 =•••= 

(4) si = s 2 = • • • = s K , 

(6) r 12 = r 13 = • • • = r/ccx-i). 

Another way to state the definitions of equations 3, 4, and 5 is to 
say that for a group of parallel tests we assume that a reasonable approxi¬ 
mation is obtained if we substitute the mean of a single test for the mean 
of the mean of all tests. In like manner, the variance of a single test 
furnishes a reasonable approximation to the average of the variances of 
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an infinite number of tests; and the covariance of a single pair of tests 
furnishes a reasonable approximation to the average of the covariances 
of an infinite number of parallel tests. This set of assumptions is neces¬ 
sary in order to enable us to substitute actual values in the formulas 
derived. 

The true score mean, variance, and standard deviation were found 
to be 

(8) M T = M g , 

(19) S 2 “ TghSg*y 

(20) S} = $ g *\/ Tgk. 

It was proved that the error score mean, variance, and standard devia- , 
tion were 

(31) Me = 0, 

(37) s 2 = $g 2 (l r gh)i 

(38) s e = Sg'Vl — r g h (error of measurement). 

The intercorrelations of observed, true, and error scores were shown to be 
(50) r te = 0. 

(46) r„ - Vl - r gh , 

(28) r tx = y/r g h (index of reliability). 

It was also shown that 

(41) s 2 = s c 2 + s 2 . 

From the definition of true score we see that the score is the same 
regardless of the particular few tests a person takes, and consequently 
the true variance for a number of parallel tests is the same. Since the 
observed and the true variances are the same for each of a set of parallel 
tests, it follows that the error variance of each of the tests should be the 
same. That is, we may write 

Tig = T*, 


These characteristics of parallel tests were assumed in Chapter 2, and 
those given in equations 3, 4, and 5 were derived. 
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The set of equations derived in this chapter are identical with those 
derived in Chapter 2. It has been shown that the fundamental equations 
of test theory can be derived either from a definition of error and a 
definition of parallel tests in terms of true and error score or from 
a definition of true score and a definition of parallel tests in terms of 
observed scores. 


Problems 

1* Write the equation corresponding to each of the following assumptions: 

A . The true score is the difference between observed score and error score. 

B. The average error score for a large group of persons is zero. 

C. True scores and error scores are uncorrelated. 

D. Error scores on one form of a test are uncorrelated with those on another form. 

E. Parallel tests have identical means. 

F . Parallel tests have identical standard deviations. 

2. Using only the necessary ones of the foregoing assumptions (and no additional 
assumptions), derive each of the following: 

(а) What is the value of the average true score? 

(б) The observed variance is the sum of true and error variance. 

(c) Find the value of the true variance. 

(d) Find the value of the error variance. 

(e) Find the correlation between observed and error scores. 

(/) Find the correlation between true and observed scores. 

Note: Work each of the foregoing six derivations independently. At the beginning 
of each of the six derivations, give the assumptions from the list (A-F) that are 
essential for that particular derivation. 

3. By using only equations 1, 20, and 28, prove that the correlation between errors 
on test g and errors on test h is zero if g and h are parallel tests. 
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Errors of Measurement, 
Substitution, and Prediction 


1. Introduction 

The commonly used and most generally useful measure of “test error” 
is the “error of measurement” defined in Chapters 2 and 3. It is the 
standard deviation of the distribution of differences between observed 
score and true score. However, there are other possible measures of 
test error that are useful for certain purposes. These measures will be 
considered in this chapter. The four different types of error are defined 
as follows: 

e = x — t, 
e = t-r xl x, 
d ” X2) 


d = 



Stated in words, these four types of error are: 


The difference between true and observed score. 

The error made in estimating true score from observed score. 

The difference between two observed scores on parallel tests. 

The error made in predicting one observed score from the score on a 
parallel test. 

These different measures of error are presented by Kelley (1927). 

The third type of error listed above is the simplest and most direct 
measure of error, so let us consider it first. 

39 
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2. Error of substitution 1 

Here we define the error as the difference between two observed scores 
on parallel tests, that is, 

(1) d = Xy — z 2 - 

This definition of error applies if we are interested in considering the 
possible differences between the results of one investigator using a given 
test and another investigator using a parallel form of the same test. 

In order to obtain the standard deviation of these difference scores, 
which is the standard error of substitution, we get the standard deviation 
in the usual way, by squaring, summing, and dividing by N. This gives 

2d 2 = Zfri - Xjf 
’ N N 


We may substitute s d 2 for the left-hand member and expand the right- 
hand member, obtaining 


(3) 


Zxi 2 t 2x 2 2 22x,x 2 
“ ~N~ + ~N N~~ 


Since the first two terms are variances and the last one a covariance, 
we may write 

(4) s d 2 = S! 2 + s 2 2 - 2ri 2 s,s 2 . 


Since standard deviations of parallel tests are equal, we may write 

(5) s d 2 = 2s 1 2 (l - r 12 ), 

where s* is the error of substitution or the error made in substituting a 
score on one test for a score on a parallel form, 

$i 2 is the observed variance of the test, 

r 12 is the test reliability, or the correlation of two parallel forms. 
Taking the square root of both sides of equation 5 gives 

(6) 8 d - 8iV 2(1 - rj 2 ) 


The error made in substituting a score on one test for a score 
on a parallel form is given by equation 6 . This is also the 
standard error of a difference for the case in which the two 
standard deviations are alike . 

1 This term was introduced by M. W. Richardson while teaching at the University 
of Chicago. 
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3. Error in estimating observed score 

Instead of saying that we substitute the score on one test for the 
score on a parallel form, we can use the ordinary “error of estimate” 
and compute the minimal error that can be made in predicting the score 
on one form from the other form by using the least squares regression 
equation. As indicated in the introduction to this chapter, we write 

(7) d — xi — r l2 (j'j x 2 . 

As before, we write the variance of d, noting that since Si = s 2 , the 
term in parentheses is unity and may be omitted. 

2d 2 2(z, - r 12 x 2 ) 2 

(8) ~W - N 


Expanding as before and substituting s& 2 for the left-hand term, we have 


(9) 




r 12 2 S:r 2 2 
~N~ + N 


2r 12 2i*i0*2 

. 


Equation 9 can be rewritten as 

(10) $ d 2 = «i 2 + ri 2 2 s 2 2 — 2ri2 2 sis 2 . 

Since the variances of parallel tests are equal we may write 

(11) Sd 2 = St 2 (l - r l2 2 ). 

Taking the square root, we have the final equation 

(12) Sd = SiVl - r 12 2 , 

which is the usual standard error of estimate. 

Equation 12 gives the error made when the regression equa -• 
tion is used to estimate the scores on one test from scores on 
a 'parallel test. 

It should be noted that equat ion 12 is the correct one to use if the 
regression equation has been used to estimate scores on a parallel form 
and we wish to determine the error involved. Equation 6 is the correct 
one to use if scores on one test are assumed to be equal to those on a 
parallel test without use of any regression equation. 


4. The error of measurement 

The error of measurement can be interpreted in several ways, which 
will be considered in the next chapter. We shall consider here only one 
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interpretation, namely, the error of measurement is the error made in 
substituting the observed score for the true score. We wish to assign 
each person a true score, and instead we assign the observed score. The 
difference between these two scores is the error of measurement. Deriva¬ 
tions of this error of measurement have been given in preceding chap¬ 
ters; however, one will be repeated here. 

One of the basic assumptions of test theory is that 

(13) x = t + c. 

Squaring and summing both sides gives 

(14) 2x 2 = 2 (t + e) 2 . 

Expanding the term in parentheses, we have 

(15) 2x 2 = 2/ 2 + 2e 2 + 22/e. 

Dividing through by N we can write this equation in terms of variances 
and covariances as follows: 

(16) s x 2 = s t 2 + s 2 + 2r ie s l s e . 

Since one of the fundamental assumptions in the definition of error is 
that it correlates zero with true score, we may omit the last term and 
write 

(17) S x 2 = S t 2 + S e 2 . 

From the previous discussion of true variance we see that r XT s x 2 is equal 
to the true variance. 1 Therefore, substituting this for s 2 and solving 
for s 2 , we have 

(18) sj 2 = s 2 - s x 2 r xx , 
which may be written 

(19) s e 2 = s x 2 (l - r xx ). 

Taking the square root of both sides, we have 

(20) s c - s x Vl — r xx , 

which is the formula previously given for the error of measurement. 

1 In Chapter 2, the symbol r XgXh was used for the reliability coefficient to emphasize 
the fact that it was the correlation between forms g and h of a test. Similarly, in 
Chapter 3 the subscripts g and h were retained so that a sum of reliability coefficients 
could be indicated. When we are not emphasizing the correlation among various 
parallel forms, it is convenient to designate the reliability coefficient by repeating a 
subscript. Thus r xx is the reliability of test x; ryy % the reliability of test y; ru and 7 * 22 , 
the reliability of tests 1 and 2, respectively; etc. 
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The error of measurement is the error made in substituting 
the observed score for the true score . 

5. Error in estimating true score 

In this chapter we have considered the error made in substituting the 
score on one test for the score on a parallel test. Also we have shown 
that the error made is smaller if we use the score on the first test to 
predict the score on the second test and then obtain the error of esti¬ 
mate. The error of measurement can be interpreted in several ways; 
one way is to regard it as the error made in substituting the observed 
score for the true score. Also it is possible to ask what error is made in 
attempting to predict the true score from the observed score. In order 
to obtain this error, we set up the usual prediction equation: 



where i is the predicted value of the true score. The difference between 
the actual and predicted true score is the error. This may be written 



The standard deviation of e or the usual error of estimate then is 

(23) s e = s<V 1 - r x t . 

Since s t = s x \ / V xx and r xl = \^r xx , as was shown in Chapters 2 and 3, 
we may write 

(24) Sq — Sx VT —■* Txx* 

The error made in using the best fitting regression equation 
to predict the true score from the observed score is given by 
equation 

It may seem paradoxical at first to note that s e = 0 if r xx = 0. How¬ 
ever, we see in equation 39 of Chapter 2 that is the standard 

deviation of the true scores. Thus s e is always some fraction of the true 
standard deviation. By inspection of the equation $ t = we see 

that if the reliability is zero, s t = 0; hence any fraction of it also is zero. 

6. Comparison of four errors 

The relative magnitude of these four errors is shown in Figure 1. 
The relationship can easily be seen if the expression 1 — r 2 is factored 



Error + standard deviation 
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.0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1.0 

r 

Figure 1 . Comparison of errors of measurement, prediction, and substitution. 
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into (1 — r)(l + r). The four errors arranged in order from smallest 
to largest may be written as follows: 

* 8 X \^ 1 Txx ^ r xxy 

8 e ~ 8x 1 Txr ^^l"> 

“ Sse’V^l. T xx Vl H“ T XXf 
S d = 8 X V1 — r xx V2. 

These terms are written so that they are identical except for the last 
factor. Since the reliability coefficient must be between zero and one, 
we shall always have r xx < 1 < 1 + r xx < 2. Thus for any given set 
of data we shall always find that s e < s e < s& < 

H*. I Summary 

^Four different sorts of test error have been considered. Two of them 
are what might be called “errors of substitution”; one is the error in¬ 
volved in substituting a score on one test for a score on a parallel test, 
the other is the error involved in substituting the observed score for 
the true score. This latter error is the most commonly used one and is 
termed the error of measurement. 

Corresponding to each of these errors of substitution is an “error of 
estimate.” One is the ordinary error of estimate, which is the error 
made in estimating score on one test from score on a parallel test. The 
other is the error made in estimating the true score from observed score. 
This latter one is almost never used, since no practical advantage is 
gained from using the regression equation to estimate true scores. 

The equations for these four errors are 


(1) 

<N 

1 

H 

II 

(7) 

d — X] — rx 2 , 

(13) 

e = x — t, 

(22) 

e = t — rx, 

while the corresponding standard deviations are 

(6) 

8d — 8 x \/^(\ Txx)i 

(12) 

Sd ~ 1 — T xx , 

(20) 

s e = s x y /1 r xx , 

(24) 

&t ^ 8 x ^/t xx {\ — T xx ). 



46 


The Theory of Mental Tests 
Problems 


[Chap. 4 


1. Give each of the four error indices for each of the following five tests. 


Test 

Number 
of Items 

Means 

Standard 

Deviation 

Relia¬ 

bility 

A 

70 

40 

10.0 

.84 

B 

50 

35 

5.0 

.72 

C 

400 

251 

55.7 

.93 

D 

200 

110 

25.8 

.89 

E 

150 

60 

15.2 

.78 


2 . Read the Douglass (1934) and Monroe (1934) articles, and write a brief r6sum6 
and criticism of the material. 

3. Forms M and L of the Stanford Binet are parallel forms. One investigator 
uses form M in a school, and later another investigator uses form L on the same 
group. An enterprising student calculates the score differences to verify the formula 
for the standard error of measurement (1 — r)a x 2 . 

(а) Will he verify this formula? 

(б) What error measure would be verified for score differences? 

(c) If the verification were not precise, what explanation would be reasonable? 

4 . One investigator used one form of an arithmetic test (a brief 5-minute form) 
in his investigation. Another investigator used another 2-hour arithmetic test and 
divided the total number solved by 24 to make results comparable with the 5-minute 
form. The standard deviation of the differences between these scores would be 
given by what formula? 

5. Form M of the Binet test has been in constant use in a given clinic. The 
director orders that form L of the Binet test be used in the future. For uniformity 
all old M scores are to be expressed in terms of the new form. 

(а) What method will accomplish this with minimum error? 

(б) What formula gives this error? 

6. Under what condition will the standard error of measurement equal zero? 

7. Under what condition will the standard error of measurement equal the stand¬ 
ard deviation of true scores? 

8. Under what condition will the standard error of measurement equal the 
standard deviation of the test scores? 

9. Under what condition is the standard error of estimate equal to zero? 

10 . Under what conditions is the standard error of estimate equal to the standard 
deviation of the variable being predicted? 
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11. Under what condition is the standard error of estimate equal to the standard 
deviation of the variable used for prediction? 

12 . Mr. A obtains a score of 117 on test E (problem 1) and Mr. B obtains a score 
of 95 on test E. 

(а) What is the standard error of Mr. A’s test score? 

(б) What is the standard error of Mr. B’s test score? 

(c) What upper and lower limits would be assigned to Mr. A’s true score at the 
0.3 per cent level? 

(d) What upper and lower limits would be assigned to Mr. B’s true score at the 
0.3 per cent level? 

13 . Study the material given by Bradford (1940). Comment on his results. 

14 . Study the equation for the reliability of a standard score given by Dickey 
(1930), and comment on this concept of error. 
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Various Interpretations 
of the Error of Measurement 


1. Error of measurement and error of substitution 

Having in the previous chapter shown the difference between several 
different types of “error,” we shall now consider more intensively the 
“error of measurement.” Several alternative derivations or “interpre¬ 
tations” of this quantity will be given in order to show more clearly 
its properties and its meaning. 

The error of measurement has already been derived as the standard 
deviation of the difference between true and observed score. However, 
the fact that this formula (s, Vl — r) involves the expression 1 — r 
instead of 1 — r 2 , as does the error of estimate, usually provokes some 
inquiry such as, “Why don’t you use 1 — r 2 in the error of measure¬ 
ment? ” It may be in order here to derive the error of measurement in 
some other ways to show the nature of the difference between it and 
the error of estimate. As shown in equation 7 of Chapter 4, the entire 
amount of the error made in predicting is called the “error of estimate.” 
Also by inspecting equation 1 of Chapter 4, we see that the entire amount 
of the difference between the score on one test and on the other test is 
charged to the “error.” Let us see what happens when this difference 
(xi — x 2 ) is charged partly to one test and partly to the other. 

If we assume that the error is partly in x and partly in y and that 
these two errors are uncorrelated, we obtain the following: 

(1) ci + e 2 = d. 

Summing and squaring, we find that 

(2) Sci 2 + S(' 2 2 + 22c,* 2 = 2d 2 . 

If we divide by N, 

(3) ®«i "t" "I - 2r, ie2 s ei s ej — Sti 

48 
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and if we assume that the error variance for one test is equal to that for 
the other test and that errors correlate zero, 

(4) 2 s„ 2 = s d 2 . 

Substituting from equation 5 of Chapter 4, we obtain 

(5) V* = Si 2 (l ~ r 12 ), 

which is the error of measurement, as previously derived from the 
definition e = x — /. 

If the difference between two parallel tests is assumed to be 
divided into two equal and uncorrelaied parts , the standard 
deviation of each of these parts is given by the usual formula 
for the standard error of measurement . 

The error of estimate, which uses the term (1 — r 2 ), is a measure of 
the total error made in predicting score on one test from score on another 
test by using the least squares regression line for prediction. 

2. Error of measurement as an error of estimate 

Let us consider another derivation of the error of measurement in 
order to aid in showing “why” we use (1 — r) instead of (1 — r 2 ). The 
error of measurement will be shown to be the error of estimate obtained 
from the regression of observed score on true score. We may begin 
with the ordinary regression equation, 



Obtaining the error of estimate in the usual way, we see that it may be 
written 

(7) $e x t = r x? ' 

Since r xt = W* x (see Chapters 2 and 3), we may also write 

( 8 ) 8e x -t == 1 ?xx) 

which is the error of measurement. 

The error of estimate derived from the regression of observed 
upon true score is the same as the error of measurement . 
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3. Error of measurement as interaction between persons 
and tests 

Those students who are acquainted with elementary analysis of 
variance methods, involving a first-order interaction—see Lindquist 
(1940) or Fisher (1946)—will find that the following material aids in 
understanding the error of measurement. Those unacquainted with 
analysis of variance should omit the remainder of this chapter. For 
additional material on the relation between analysis of variance and the 
error of measurement, see Hoyt (1941), Jackson (1939), (19406), and 
Kaitz (1945a). 

Let us consider first the case of two tests, designated as x and y. The 
average of the two scores (x + y)/ 2 is designated as a. We have thus 
the following matrix of scores for N persons: 

X\ x 2 x 2 • • • xn M x first score 

2/i 2/2 2/3 • • • yw M y second score 


a i a 2 az • • • a at average of first and second scores. 

With no loss of generality, we can assume that the total mean (2x + 2 y) 
is zero; then the sum of squares due to tests is 

N(M X 2 + M y 2 ). 

The sum of squares due to persons is 

22a 2 , 

and the total sum of squares is 

2x 2 + 2 y 2 . 

Thus the sum of squares due to interaction is 

2x 2 + 2 y 2 - 22a 2 - N{M 2 + M 2 ). 

If, in the foregoing expression, we substitute (x + y)/2 for a and 
write I 2 for the sum of squares due to interaction, we have 

(9) I 2 = Zx 2 + Zy 2 - —- X + -— - N(M X 2 + M 2 ). 

4 

Expanding the third term gives 

Zx 2 Zv 2 

(10) I 2 = Zz 2 + Zy 2 -— - Zxy - N(M X 2 + M 2 ). 

2 2 
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Combining similar terms, and adding and subtracting NM Z M V , we have 
Equation (11) 

9 2x 2 V 

I 2 - — + — - 2xy + NM x M y - N{M 2 + M 2 ) - NM x M y .. 

2 2 


By dividing the term in parentheses into two equal parts and writing 
them separately, we have 


, x 9 Sx 2 — NM 2 Xy 2 - NM 2 

(12) J 2 ---+ —---- - (2X2/ ~ NM X M V ) 


- - (M x 2 + My 2 + 2M x My). 
2 


We obtain the variance due to interaction by dividing the sum of squares 
due to interaction by the degrees of freedom (N — 1). Thus we obtain 


(13) 


7 2 


N - 1 


2x 2 - NM 2 t 2y 2 - NMy 2 2 xy - NM x M y 

2(AT -17" + 2(JV - 1) AT - 1 


N 

77 - 1 ) 


(Af, + Af„) 2 . 


Since = —M yy the last term in equation 13 vanishes. The other 
terms are equal to variances and covariances so that we can rewrite the 
equation as follows: 


(14) 


/ 2 


N - 1 



^xy^x^y 


If s x = s yy we may put s x in place of s y and use r xx instead of r xy , obtaining 


(15) 

which is equal to 

(16) 


N - 1 


= s* - r x 


I 2 


N - 1 


Sj 2 (l - T xx ). 


The variance dm to interaction between persons and tests is 
the square of the error of measurement . 

Let us extend the demonstration of the relationship between interac¬ 
tion and error variance to the case of K tests, where K may be any 
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x n 

*12 

*13 

* ‘ • *1 K 

ai 

x 2 i 

*22 

*23 

* ' * X 2 K 

d 2 

^31 

*32 

*33 

* * * *3* 

<*3 


%N 1 %N 2 -EA t 3 


%NK dN 


Mi M2 M3 


M K 0. 


Again let us assume that the grand mean is zero since this will simplify 
the expressions without loss of generality. Let us use i and j (varying 
from 1 to N) as the subscripts representing persons, and g and h (vary¬ 
ing from 1 to K) as the subscripts representing tests. We may then 
write the total sum of squares as 

N K 

£ £ 

1 = 1 g=l 


The sum of squares for tests would be 


where 


K 


N £ M 

g=i 


2 

g > 


Mg 


N 

Z r,t 


1 = 1 


N 


Similarly the sum of squares due to persons would be 


where 


N 

K £ a, 2 , 
1=1 

K 


g=l 



The sum of squares due to interaction may be designated by I 2 and 
written as 


N K N K 

/ 2 =IE *tg 2 - K £ a, 2 - AT £ M g \ 

1 = 1 £» 1 t**l g = l 


(17) 
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Using the definition of a, given above, we can write 

A r k K N T K I 2 K 

(18) 7 2 = £ £:r,. 4 2 - —£ £*J -jv £ M g 2 . 

»=I g*e 1 A 1—1 Lg*! J 

This equation can be rewritten as 

K r N -| 1 K N 

09) f-I W -;IW 

g=l L t=l J A l== l 

2 a: 2 —ic n k 

- — £ £ av*ayA - N £ M* 2 . 

a g=^/i=i 1=^1 g=i 

K 2 —K 

The last term may be written in two parts, and ]T) NM g Mh/K can 

be added and subtracted, giving terms that constitute variances and 
covariances as follows: 

(20) 7 2 = £ (l - 1) [ £ - NM e 2 j 

j K 2 —K r N t 

- -- £ £ ^ih - NM t Mk 

A. L J 

2 k 2 a: 2 -/?: 

-rZNM*-- £ NMgM},. 

A g_i A g-^h^l 

The last two terms can be combined into a squared term: 

<2i) 

\ A / g^i L,=i J 

1 K*-K r N 1 N I - * I 2 

-- £ I £ X,-,*,* - NM g M h I - - I £ M g I . 

A L ts *i J A Lg^x J 

The last term can be omitted since 2 M g is zero. We thus have the 
final expression for sum of squares due to interaction, as 

1 K>-K r JV 

-- £ I £ **** - NM t M h I • 
A g+h —1 L t «»i J 


(22) T 2 
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Dividing by the appropriate degrees of freedom (K — 1 )(N — 1), we 
obtain 


(23) 


J 2 


-is 


(K - l)(N - 1) K g ~L N- 1 


AT 


E ** 2 - JW 


i—i 


i 


K*-K 


K(K - 1) 


N 


E - NM e M h 
1 

. iV - 1 . 


which can be rewritten as an average variance and covariance. 


(24) 


J 2 






o 2 _ 


1 


K*-K 


53 fghSgSh. 


(K - 1)(JV - 1) X ~i X 2 - ^ 

This gives the final expression for variance due to interaction: 

/ 2 
(25) 


= fe* 2 ) “ r gh$gSh' 


(K - 1)(AT - 1) 


That is, the variance attributable to interaction between persons and 
tests is the average test variance, minus the average intertest covariance. 
Since we are dealing with parallel tests we may assume that the variances 
are equal and that the intercorrelations are equal, giving 


(26) 


I 2 

(K - 1 )(N - 1) 


s g 2 (l r gh )• 


For the general case of K parallel tests , the variance due to 
interaction between persons and tests is the square of the error 
of measurement . 

If the error of measurement is small as compared with the standard 
deviation of the test, the interaction variance is small (as compared with 
the true variance discussed in the next section), and the different tests 
are highly correlated. Correspondingly, if the interaction variance is 
large, as compared with variance due to persons, the error of measure¬ 
ment is large, relative to the standard deviation of the test, and the 
reliability of the test (the correlation between parallel forms) is low. 


4. Relation between true variance and variance due to persons 

• In considering the relationship between test theory and analysis of 
variance, we have shown that interaction between persons and tests is 
identical with error of measurement. It is clear that if all the tests 
had the same mean, the variance due to tests would be zero. Thus the 
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variance due to tests simply measures the extent to which the means of 
the different tests are identical. If this difference is unduly large, the 
tests are not “parallel” in the sense of having equal means. 

Since in test theory, the “true variance” represents a variance between 
persons, we should expect to find some relationship between true 
variance and “variance due to persons.” Referring to the previous 
section, we find an expression for the sum of squares due to persons. 
Designating this by P 2 , we may write 


N 


(27) 

P 2 = K'Z a, 


1=1 

where 

K 


z 

(28) 

g=l 

a t - =- 

K 


We may substitute equation 28 in equation 27, obtaining 


(29) 



The square of a sum may also be written as a double summation, giving 


Y N K K 

(30) 

K 1=4 g=i h=i 


Changing the order of summation, we obtain 

! K K N 

(3i) p 2 = — E Z Z *.**«• 

A s= i *==1 , = i 


Let us now explicitly separate the terms involving a sum of squares 
from those involving a cross product, obtaining 


(32) 


i K N 1 K K N 

f 2 = -EIx ** 2 + -Z EEx fg x, k . 


K 


g=l i=»l 


K , 


=i /i=i t=i 


We observe that, since all terms where g = h are excluded from the 
second expression, it contains only ( K 2 — K) terms. In order to have 
the upper limit of summation indicate the number of terms, we shall 
write equation 32 as follows: 
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Referring to the preceding section, we note that we assumed the mean 
of all means to be zero, but did not specify that the mean of each test 
was zero. Therefore, in order to obtain deviations from the mean, it 
is necessary to rewrite equation 33 as 

j K r jv -i i r N “I 

(34) - £ £ Xu 2 - NM* I + - £ I £ x ig x ih - NM t M k I • 

A gami L,*i J K g+h =l U*i J 

This change, however, does not affect the value of P 2 since the terms 
added total to zero. We can see this by writing them explicitly: 

K K 2 —K 

(35) £ NM g 2 + Z NM g M h , 

£=1 i^shsssl 

and then, expressing equation 35 as the square of a sum, we write 

r K -|2 

(36) N ^ £ M g J . 

Since the term in brackets is the grand mean, it equals zero, which shows 
that the value of equation 33 is equal to the value of equation 34. 
Hence we may write 


Equation (37) 

l K r N 1 l K 2 -K r N 1 

P 2 = p £ £ *i* 2 - NM t * + - £ £ x ig x ih - NMgM k ■ 

rv L,_i J K i L ,=i J 

Dividing now by the appropriate degrees of freedom (N — 1), we obtain 

Equation (38) 



r N i 


- N 

- 

p2 _ 1 f 

£ V - NM e 2 
1=1 

1 K 2 -A' 

+ K 2 

A g^h^l 

£ XigXih 
*-1 

- NM g M h 

N — 1 Ifi 

N - 1 J 

N 

- 1 


This equation may be rewritten explicitly in terms of average variance 
and covariance, giving 

p 2 _ _ 

(39) —-- = (s* 2 ) + (K - 1 )r eh s g ,H h . 

N — 1 


Equation 39 gives the value of the variance due to persons. It is the 
average variance plus (K — 1) times the average covariance.' 
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We can see that, if we divide equation 39 by K, we obtain 

P 2 (?) 


(40) 


+ 


('-*) 


K(N - 1) K 
and that, if we let K approach infinity, we obtain 

P 2 _ 

(+ 1 ) 




lim 

jc-. » K(N - 1) 


~ TghPg*'hi 
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which is equivalent to true variance as discussed in Chapters 2 and 3. 

For K parallel tests , one Kth of the ? alue of the variance due 
to persons approaches the true i ariance as K approaches in¬ 
finity. 


5. Summary 

Several different interpretations of the error of measurement have 
been given. 

1. The error of measurement is the standard deviation of the differ¬ 
ences between the observed score and the true score. 

2. If the difference between score on two parallel tests is regarded as 
being made up of two equal and uncorrelated components, the 
standard deviation of the distribution of these components is the 
error of measurement. 

3. The error of measurement is identical with the error of estimate 
based on the regression of observed on true scores. 

4. The error of measurement is the square root of the variance due 
to interaction between persons and tests, provided one assumes a 
set of parallel tests. 

In addition to these interpretations of error variance, we have also 
seen that the true variance is the limit, as K approaches infinity, of 
1/Kth of the variance due to persons. 


Problems 

1. Show that, if x\ — is regarded as divided into two equal and uncorrelated 
parts, the standard deviation of one of these parts is the error of measurement. 

2. Show that the error made in estimating observed score from true score is the 
error of measurement. 

3. Show that the standard deviation of the difference between observed and true 
score is the error of measurement. 
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4 . Derive the equation for predicting scores in X\ from scores in X 2 , where X\ 
and X% are two different forms of a test whose reliability and validity are both known. 
Derive the equation for predicting the true score Y from Y\, where Fi is a test whose 
reliability and validity are known. 

5 . (a) Describe the correct experimental method for checking the applicability 

of the equation _ 

0 $ 38 ay\^ 1 Txy*t 

where a e is the standard error of estimate made in estimating y from x f 
a y is the standard deviation of the observed y distribution; and 
rxy is the correlation between x and y. 

(b) Would the method you suggest be expected to give an exact agreement, or 
only an approximate agreement? 

(c) What explanations would you offer for a failure to check the equation? 



6 

Effect of Doubling Test Length 
on Other Test Parameters 


1. Introduction 

We have now considered in some detail the parameters used to 
describe a test. They are the mean, the three variances (observed, true, 
and error), the reliability coefficient, or self-correlation, and the corre¬ 
lation with true score (or “index of reliability”). 

We have also considered the interrelationships of these parameters. 
The number of items in a test is another important parameter of a test. 
It is the one characteristic of a test which can be most readily controlled. 
If there is a good supply of items, it is relatively easy to decide to use 
30 or 100 or 200 items in a test. What is the best number of items to 
use? The decision depends in part upon limitations of testing time 
available, but the number of items should also be sufficiently great to 
insure for the test a sufficiently high reliability and a sufficiently low 
error of measurement. How many items are necessary to give a relia¬ 
bility of, let us say, .95? In order to answer such questions, it is neces¬ 
sary to know the relationship between test length and the other param¬ 
eters. Let us turn to a consideration of the effect of the length of a test 
upon its reliability, error of measurement, and other parameters. 

2. Effect of doubling a test on the mean 

First let us consider merely doubling the length of the test, after 
which we shall take up the more general case of increasing the length 
any number of times. 

Let us consider the effect of doubling a test upon its mean and standard 
deviation. If we designate the original test by the subscript 1 and the 
added portion by the subscript 2, the “composite” score of the tth 
person (X ic ) may be written 


(1) 


Xu = X« + x i2 . 

59 
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The average may be found by summing and dividing by N, obtaining 

N N N 


( 2 ) 


£ Xi, £ x n £ x i2 
1=1 

N = N ~N 


Since the mean equals the sum of scores divided by N, we have 

(3) M r = M, + M 2 . 

Since the two tests are parallel, the mean of test 2 will be equal to the 
mean of test 1, and we have 

(4) M c = 2 Ml 


Doubling the length of a test doubles the mean, provided the 
original part and the added part are parallel tests. 


3. Effect of doubling a test on the variance of gross scores 

We shall next observe what happens to the variance of a test when 
its length is doubled. Again we begin with the gross score 

(5) X e = Xi + X 2 . 

Since the mean of the combined tests equals the sum of the two part 
means, we may convert to deviation scores by writing 

(6) X c — M e - (X, - Mi) + (X 2 - M 2 ), 
which may be written 


(7) x e = Xi + x 2 . 

Squaring both sides, summing, and dividing by N, we have 

Say 2 Sa-j 2 Sz 2 2 21,x 1 x 2 

- = - -| - + - 

N N N N 


( 8 ) 


Expressing this in terms of variances and covariances, we have 

(9) s e 2 = «i 2 + s 2 2 + 2ri2Si«2. 

Since the tests are parallel, si = s 2 , and we may write 

(10) s c 2 = 2si 2 (l + r 12 ). 

We may take the square root of both sides in order to obtain the 
standard deviation. This gives 


(11) 


Sc = siV2(l + r 12 ). ' 
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Doubling the length of a test increases its standard deviation 
as indicated in equation 11, provided that the original part 
and the added part are parallel tests. 


4. Effect of doubling a test on true variance 

Since the “true score” of a given person is the same on the original 
and on the new part of the test, his true score on the combined tests 
is double the original true score: 


(12) T c = T\ + T 2 = 27V 

Since the mean true score is likewise doubled, we may also write the 
same equation in deviation score form as 

(13) t c = 2t\. 


Squaring, summing, and dividing by N gives 


(14) 

which may also be written 

(15) 


2* c 2 

N 


% 


42 ti 2 


4s,, 2 . 


Taking the square root of both sides gives 
(16) s lc = 2 s tr 


Doubling the length of the test doubles the true standard devi¬ 
ation, or quadruples the true variance, when the original part 
and the added part are parallel tests. 


5. Effect of doubling a test on error variance 
From equations 5 and 12, we may write 

(17) X c -Tc= {X x - Ti) + (X 2 - r 2 ). 

This expression is clearly the “error score” for each of the part tests 
and the composite; therefore we may write 


(18) e c = e x + c 2 . 

Squaring both sides, summing, and dividing by N gives 
2e c 2 


(19) 


2#i 2 2c 2 2 22 e\e 2 


N 


N 


+ - N - + 


N 


which may also be written 


( 20 ) 


2 
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Since by the definition of random error the correlational term vanishes, 
and the error of measurement in 1 is equal to that in 2 because the tests 
are parallel, we may write 

(21) s ec 2 = 2s ei 2 . 

Taking the square root of both sides gives 

(22) S" = s ei V2. 

When a test is doubled in length the error variance is doubled 
or the error of measurement is multiplied by the square root 
of two, if the original part and the added part are parallel 
tests. 

6. Relation of true, error, and observed variance 

Let us check to see if equation 18 of Chapter 2 (the true variance 
plus error variance equals the observed variance) holds for the double¬ 
length test. Equations 10, 15, and 21 give the value of the observed, 
true, and error variance for the double-length test. Let us set equation 
10 equal to the sum of equations 15 and 21 and see if this gives an 
identity. We have 

(23) 2«i 2 (l + ri 2 ) = 4 s h 2 + 2 s ei 2 . 

As has been shown previously, in equation 38 of Chapter 2, r 12 si 2 = s t 2 , 
and s e 2 = $i 2 (l — r); see equation 41, Chapter 2. Substituting these 
values in equation 23, we have 

(24) 2«i 2 (l + r) = 4rs x 2 + 2$i 2 (l — r). 

Since this expression is an identity, the relationship previously estab¬ 
lished for the single-length test still holds for the double-length test, so 
that the equations developed are not inconsistent among themselves. 

Observed variance is equal to true variance plus error variance 
for the double-length test. 

7. Effect of doubling the length of a test on its reliability 
(Spearman-Brown formula for double length) 

Since the reliability of a test means its correlation with a parallel 
form, we shall assume four unit parallel tests designated by the sub¬ 
scripts 1, 2, 3, and 4, and then determine the reliability of a test of 
double length by obtaining the correlation of 1 + 2 and 3 + 4. Sub- 
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(25) 


stituting in the deviation score formula for correlation, we may write 

+ x 2 ) (x 3 + x 4 ) 

(1+2)(3+4) y/'2( Xl + z 2 ) 2 VSfe + X 4 ) 2 

This equation can be expanded into 
Equation (26) 

HxiXs + x'L X4 + 2Jx 2#3 + 2Jx2X4 


r ( i+ 2 )(3+4) = 


\/Zxi 2 + hx 2 2 + 22:r 1 :r 2 y/'Zxz 2 + 2x 4 2 + 2 HX 3 X 4 


Dividing the numerator and the denominator each by N, we can write 
the result in terms of variances and covariances. We may also simplify 
the denominator by noting that, since we are dealing with parallel 
tests, the variance of 1 + 2 will equal that of 3 + 4. Making these 4 
changes gives 


(27) 


**(1+2) (3+4) = 


7*13^1 ^3 + r 23 5 2 s 3 + Pl4 s l$4 + ^24^2 S 4 
•Si 2 + s 2 2 + 2r 12 s lv s 2 


We may write this result in terms of average variance and average 
covariance: 


(28) 


2 (Xgh^g^h) 


^*(1+2) (3+4) ~ 2 

s i 


+ ri 2 sis 2 


Since we have parallel tests, the variance of the first test and the co- 
variance of the first two can be used in place of the average, giving 


(29) 


^*(1+2) (3+4) = 


2r i 2 sis 2 
s x 2 + ri 2 s x s 2 


Since the standard deviations are equal we can divide numerator and 
denominator by Si 2 . Let r c stand for 7 , ( 1 + 2 )(3+4)> the reliability of the 
composite test. 


(30) 


r c 


2 r 12 
1 + r i2 


(Spearman-Brown formula 
for double length). 


If the length of a test is doubled by adding a parallel form , 
the reliability is increased as indicated by equation SO. 

This is the conventional Spearman-Brown formula for double length. 
It gives an estimate of the reliability of a test if the test is doubled in 
length. Equation 30 is probably more commonly used than any of 
the other equations in this chapter. Whenever the reliability of a test 
is computed by correlating odd with even items, or some other split-half 
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method! the correlation between the halves is substituted (as r i2 ) in 
equation 30 to give the reliability coefficient for the total test. In this 
manner we have an estimate of the correlation of the test with a parallel 
form. Equation 30 is, of course, not used when reliability is determined 
by correlating two parallel forms of a test. 

By checking back over the derivation, we note that nothing has been 
assumed except that the tests in question are parallel. More explicitly, 
equality of standard deviations and of intercorrelations has been 
assumed, but nothing else . In other words, it has been assumed that 

= «2 = $3 = $4 and that r 12 = r 13 = r i4 = r 2 3 = r 24 = r 3 4 . If these 
two sets of equalities hold, then the Spearman-Brown formula is simply 
a computational short cut for figuring the reliability of the double-length 
test. By the device of using average variance and average covariance, 
as in equation 28, we see that, if the variance si 2 and the covari¬ 
ance ri 2 $i$ 2 differ only slightly from the average variance and covariance 
in equation 28, then equation 30 gives a good approximation to the 
reliability that will be found for the doubled test. 

The application of formulas for a double-length test may be illus¬ 
trated by the following example. A 50-item test has a mean of 42.0, a 
standard deviation of 5.3, and a reliability of .85. By the use of equation 
38, Chapter 2, or equation 19, Chapter 3, the true variance of this test 
is 23.88; and from equation 41, Chapter 2, or equation 37, Chapter 3, 
the square of the error of measurement is 4.21. If this test is increased 
to 100 items by adding another 50 items that are parallel to the first set 
of 50, we should find the following statistics for the 100-item test: 


Mean 

84.0 

(from equation 4); 

Standard deviation 

10.2 

(from equation 11); 

Variance of the errors of measurement 

8.42 

(from equation 21); 

True variance 

95.52 

(from equation 15); 

Reliability 

.92 

(from equation 30). 


The important thing to emphasize here is that to the extent to which 
we are able to construct parallel tests we do not have a “prophecy” 
formula, but simply a computing formula. If the Spearman-Brown for¬ 
mula fails to “work” or predicts, “inaccurately” in any case, this simply 
means that the correlation used (for example, the correlation ri 2 ) was 
larger or smaller than the average of the intercorrelations r g h. There 
have been several “empirical studies” of the accuracy of the Spearman- 
Brown formula. Strictly speaking, it needs no verification, and cannot 
be verified. It is possible, however, to intercorrelate the four halves of 
the two parallel tests and to see if the tests are really parallel in the sense 
that r 12 *= r 13 = r 14 = • • • = r 34 , and in the sense that the four variances 
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are equal. If so, the Spearman-Brown formula cannot fail to “work" 
except through arithmetical error; and, if these assumptions are not 
verified, the Spearman-Brown formula does not apply, since the tests 
are not parallel. It is possible to investigate empirically the extent to 
which the amount of departure from “strict parallelness” that is usually 
found affects the applicability of the Spearman-Brown formula. 

The foregoing remarks apply to a test that is primarily a power test. 
If a test is a pure speed test, the equations showing the effect of test 
length will apply only if the test is administered on a work limit basis 
(each person finishes the test and his time is recorded), and no serious 
warm-up or fatigue effects result from doubling the number of items and 
allowing each person whatever time is needed to just complete the 
lengthened test. For the more usual group speed test we have a time¬ 
limit method. In this method a fixed time is allowed, and the number 
of items completed is recorded. Ideally, there should be so many items 
that only the fastest person would complete the test in the time allowed. 
Then doubling the “test length” would mean doubling the number of 
items and allowing double the original time. The formulas for the effect 
of doubling test length would then apply, if there were no serious 
warm-up or fatigue effects during the second half of the longer time 
limit. The only general rule that can be given is to point out that the 
added portion of the test must be parallel to the old portion; and that 
the criterion of whether or not the two parts are parallel is the equality 
of means, variances, and errors of measurement for the two parts. To 
the extent to which these equalities do not hold, the formulas for the 
effect of doubling the length of the test will not apply, and cannot be 
expected to apply. A statistical criterion for parallel tests is given in 
Chapter 14. 

8. Experimental work on the Spearman-Brown formula 

It is interesting to note that the formula for the reliability of a double¬ 
length test (equation 30) and its generalization to A-parallel tests, 
equation 10 of Chapter 8, were first derived in 1910 in the British 
Journal of Psychology , Volume 3. Spearman presented it on page 290, 
find William Brown presented it in the succeeding article (see page 299 
of the same volume). It is therefore referred to as the Spearman-Brown 
formula. 

The earlier articles on the Spearman-Brown formula were vigorously 
adversely critical; see Holzinger (19236) and Crum (1923). Holzinger 
concluded that if we lengthened the test beyond five times its original 
length, the Spearman-Brown formula gave an overestimate of the re¬ 
liability. No mention was made of the decreasing likelihood that the 
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first correlation would be equal to the average of all correlations or the 
decreasing likelihood that the first variance would equal the average of 
all variances. 

Subsequent studies have led to the conclusion that the results obtained 
from the Spearman-Brown formula are reasonably accurate; see Kelley 
(1924), which is a reply to Crum's criticism, Kelley (1925), Holzinger 
and Clayton (1925), Ruch, Ackerson, and Jackson (1926), and Wood 
(1926c). Thurstone (1931a) has reviewed and summarized much 
of this material on the empirical verification of the Spearman-Brown 
formula. 

Slocombe (1927a, 19275) reviewed several studies. He pointed out 
that it was assumed that the coefficient substituted in the formula was 
“representative” of the group so that this coefficient must be selected 
with care. 

Dunlap (1933) in discussing the problem of test reliability suggested 
that a tetrad technique should be used to determine whether or not 
split fourths of a test are measuring the same thing. It should be noted 
that a different assumption is made in the actual derivation of the 
Spearman-Brown formula. 

It has also been suggested that the Spearman-Brown formula applies 
to rating scales and judgments and that it may be used in predicting the 
reliability to be expected by increasing the number of judges or raters; 
see Gordon (1924), Furfey (1926), Remmers, Shock, and Kelly (1927), 
and Remmers (1931). In general, this suggestion has been found to be 
correct. By considering the only assumption made in the development 
of the Spearman-Brown formula, we see that, if the variance and reli¬ 
ability for the first rater are equal to the average variance and inter¬ 
correlation for all raters used, the Spearman-Brown formula must give 
the correct result. If the formula does not give the correct result, it is 
because the variances from rater to rater or the covariances for various 
pairs of raters differ markedly. In other words, the problem here is not 
“Does the Spearman-Brown formula work?” but is “Do the ratings 
from the different raters satisfy the criterion for parallel tests?” 

Lanier (1927) interpreted the work of Gordon (1924) and Kelley 
(1925) to mean that correlations increased as the number of cases 
increased. He showed that this was not true. Thurstone (1928a) points 
out clearly the difference between number of judges making a rating 
and number of cases in a correlation scatter plot. 

It J&as also been suggested that increasing the number of choices in a 
scale^hpjight increase the reliability according to the Spearman-Brown 
forfoma (see Remmers and Ewart, 1941); and that increasing the 
liito>er of alternatives in a multiple-choice test will increase reliability 
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in the same way (Remmers, Karslake, and Gage, 1940, and Denney and 
Remmers, 1940). 

Denney and Remmers reported that random elimination of incorrect 
alternatives from a five-alternative multiple-choice test resulted in a 
reduction of reliability that agreed with the Spearman-Brown formula. 
However, we see that there is no reason to expect the formula to work 
in this case, since it is meaningless to think of equality of variances or 
covariances for alternatives on a multiple-choice test. The logic relating 
number of alternatives to test reliability has been given by Carroll 
(1945) (see formula 30, page 11). An empirical study of this formula is 
being undertaken by Mrs. Plumlee, assistant director of Test Develop¬ 
ment for the Educational Testing Service. 

Nomographs or tables for the Spearman-Brown formula are given by 
Arnold and Dunlap (1936), Cureton and Dunlap (1930), Dunlap and 
Kurtz (1932), and Edgerton and Toops (19286). 

In general, therefore, the work on the Spearman-Brown formula 
shows that even when relatively little effort is made to obtain parallel 
tests (or ratings) the formula gives reasonably good results. From the 
derivation it is clear that, if we are dealing with tests or ratings known 
to be parallel , the formula must give correct results. 

9. Summary 

For the double-length test it has been shown that: 


(4) 

M e = 2 Mi, 

(ID 

Sc = «i'V / 2(l + r 12 ), 

(16) 

s tr = 2 s tl , 

(22) 

Co 

£ 

II 

& 

l 

(30) 

2rj2 

r c = 

1 +r 12 


For these equations the subscript c is used to denote the value for the 
composite or double-length test, the subscript 1 designates the mean or 
standard deviation of the unit-length test, and ri 2 is the reliability of 
the unit-length test. 

Problems 

1. Prove that the correlation between true and observed score for a test of double 
length is V2r/(1 + r), where r is the reliability of the original test. List the assump¬ 
tions used in making this derivation. 
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1 Prove that each of the other basic relations derived in Chapter 2 (or Chapter 3) 
holds for the augmented test. 

3 . 


TpsI 

Number 
of Items 

Mean 

Standard 

Deviation 

Relia¬ 

bility 

A 

2.50 

180.1 

30.0 

.90 

B 

30 

20.0 

4.5 

.72 

C 

100 

69.3 

12.4 

.87 

D 

200 

83.7 

22.8 

.91 

E 

50 

37.4 

7.4 

.83 


Estimate the reliability of each of the foregoing tests if it is doubled in length. 

4 . Head the last section of the article by Lanier (1927) and comment on this 
application of the Spearman-Brown formula. Refer also to Thurstone (1928a). 
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Effect of Test Length 
on Mean and Variance (General Case) 


1. Effect of test length on mean 

We shall now extend the discussion to consider the general case, the 
effect of test length on mean, variance (true, error and observed), and 
on reliability and index of reliability. 

If we designate the composite score of the ith person by X ic , we may 
write 


(1) X ic = X n + X i2 + ---+X iK . 


Summing and dividing by N to obtain the mean, we have 


( 2 ) 


s Xic = ZXg SX i2 HXiK 

N N N N 


\ 

Substituting the mean for the sum of scores divided by N, we have 


(3) M c = Mi +M 2 +---+M k . 

Since all the tests are parallel, the mean of each of the component tests 
will be equal. Mi can be substituted for each of the means, giving 


(- 1 ) 


M r « KM V 


Increasing the length of a lest K times multiplies the mean 
by K, provided that each of the new parts is parallel to the 
original. 

2. Effect of test length on variance of gross scores 
As before, we can begin with the expression for the composite gross 
score, 

(5) 


X e »X 1 + X 2 +.••+**. 

69 
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Since the mean of the combined tests equals the sum of the part means 
(see equation 3), we may convert to deviation scores by writing 

(6) X c - M c - (X x - Mi) + (X 2 - M 2 ) +•. •+ (X K - M k ). 
Using lower case x for a deviation score, we may write 

(7) x e = X\ + x 2 + • —b %K' 

In order to obtain the standard deviation, we square both sides, sum, 
and divide by N , obtaining 

Xx c 2 2($1 + %2 + * * * + Xk ) 2 


( 8 ) 


N 


N 


If we expand the numerator of the right-hand side of the equation, it 
will equal the sum of all the terms in the following matrix. 


SX! 2 

Xx\x 2 

Sxjx 3 • • • 

ZxxXK 

SxjX 2 

Sx 2 2 

Sx 2 x 3 

2x 2 XK 

SX!X 3 

2x 2 x 3 

Sx 3 2 • • • 

Zx&k 

s XiXk 

2x 2 x/c 

2x 3 x K • • • 

?X K 2 . 

irough by N, we have the sum 
riance matrix that can be written 

of all the 

si 2 

ri 2 SiS 2 

r 13$l$3 

tik^iSk 

ri 2 sis 2 

*2* 

r 23$2 5 3 

T2K^K 

r 13 s l s 3 

r 23 5 2 s 3 

«3 2 

TsKS&K 

UkSiSk 

r 2KS2&K 

tskSsSk 4 * * 

SK 2 - 


The sum of all these terms is the variance of the composite test composed 
of the sum of all the tests from 1 to K. 


(9) 


K K K 

M*;+EE 

*-l «—1 A_1 


(fir * h). 


The variance of the composite test may be expressed in terms of the 
average variance and the average covariance as follows: 


s c 2 = K(s g 2 ) + K(K - 1)( WA ). 
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If the tests are parallel, the average s g may be replaced by and the 
average covariance by r 12 si$ 2 . If we factor out the s 2 K } we have 

(11) s c 2 * *i 2 K[ 1 + (K — 1 )r 12 ]. 

Lengthening a test K times increases the variance , as indicated 
in equation 11. 

Taking the square root of both sides and writing rn for the reliability 
coefficient, we have 

(12) S„ = 8iV k + IC(K - Dm, 

where Si is the standard deviation of the unit length test, 
r\\ is the reliability of the unit length test, 

K is the ratio of the number of items in the new test to the num¬ 
ber in the unit length test, and 
s c is the standard deviation of the lengthened test. 

Multiplying the length of a test by K increases the standard 
devialion } as indicated in equation 12. 


3 . Effect of test length on true variance 

Since the “true score” of a given person is the same on each of the 
part tests, we may write the true score on the composite test as 

(13) T c = KT V 

Since the mean true score is likewise multiplied by K, we may write 
this same equation in deviation form, 


(14) t c = Kt\. 

Squaring, summing, and dividing by N y we obtain 


(15) 


N N 

E Uc 2 k 2 E tn 2 

»=i i=i 


N 


N 


which is equivalent to 

(16) s lc 2 = K\ 2 . 

Taking the square root of both sides, we have 

(17) s tc = Ks tl . 

Multiplying the length of a test by K multiplies the true 
standard deviation by K. 
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4. Effect of test length on error variance 
From the equations developed in the preceding sections (see equations 
5 and 13), we may write 

(18) X c - n - (X x - TO + (X 2 - T,) +•••+ (X* - Tjr). 

We may use e to represent the error score and write 

(19) e e = Ci + c 2 + 63 + • • • + 

Squaring both sides, summing, and dividing by N, we obtain 
Se c 2 S(ei + c 2 H-h ejr ) 2 

(20) ~iV AT 

This expression may be set up as the sum of all the terms in the follow¬ 
ing matrix: 


Sd 2 

SCld 

Sd®3 

2d dr 

Se,e 2 

Se 2 2 

2e 2 e 3 

2e 2 fA- 

Sei<>3 

2e 2 <? 3 

Se 3 2 • • • 

2c 3 ca' 

Seiejc 

Sc 2 ex 

2e 3 e K ■ ■ ■ 

2e/c 2 . 


Dividing by AT, we have 

( 21 ) S e<J 2 2=5 S^j 2 ” 1 " “h * * * H” 

J«ieAi$ea “h * ' * “f" 

Since, by the definition of random error, the correlational terms vanish, 
we can write 

( 22 ) C = £ «** 

This may also be written in terms of the average error variance as 
follows: 

(23) »«.* = *(0. 

If we assume that the error variance of the first test is equal to the 
average error variance, we may write 

(24) «e 2 - 

Taking the square root of both sides gives 

(26) *«. = 8 „Vk. 
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Multiplying the length of a test by K multiplies the error vari¬ 
ance by K or the error of measurement by the square root of K. 

5. Summary 

For the general case of lengthening a test to K times its original 
length, the effect on the mean and the different variances has been 
shown. We have 


(4) 

M c = KMi, 

(12) 

s c = ®iV K + K(K — l)r n , 

(17) 

s le = Ks tl , 

(25) 

II 

> 


As in Chapter 6, the subscript c is used to designate the composite 
score. In this case, however, it is the composite score formed by adding 
scores on K parallel forms. The subscript 1 is used to designate a mean 
or standard deviation of the original unit test, and rn is the reliability 
of the original unit test. 


Problems 

1. Prove that the observed variance of an augmented test is equal to the sum of its 
t rue and error variance. 

2 . Prove for the other basic relationships established in Chapter 2 and Chapter 3 
that they still hold for the test augmented K times. 

3 . 


Test 

Mean 

Standard 

Deviation 

Number 
of Items 

Relia¬ 

bility 

Number of 
Subjects 

A 

73.2 

12.7 

120 

.92 

300 

B 

17.3 

3.8 

25 

.86 

250 

C 

21.3 

7.1 

50 

.80 

430 

D 

29.3 

7.9 

75 

.84 

150 

E 

56.5 

13.7 

100 

.89 

200 


(а) Estimate the variance of test A if it is increased to 240 items. 

(б) Estimate the true variance of test B if 75 items are added. 

(c) What is the error of measurement for test C? 

(d) What will the error of measurement be if test C is lengthened to 150 items? 

(e) How many items would need to be added to test D to double the true variance? 
(/) How many items would need to be added to test E to double the error of 

measurement? 

(g) You would like to increase the standard deviation of test B to 7.6. How many 
items would it be necessary to add to the test? 
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Effect of Test Length 
on Reliability (General Case) 


1. Introduction 

The equation for the relationship between test length and test reli¬ 
ability will be developed from the usual formula for the correlation 
between any two sums. No assumptions will be used in the derivation 
until the last step. There it will be assumed that the variance of the 
unit test can be taken as a fair approximation to the average variance 
of all the unit parallel forms, and the reliability of the unit test times 
its variance can be taken as a reasonable approximation to the average 
covariance among all the unit parallel forms. 

2. The correlation between any two sums 

First let us write the formula for the correlation between any two 
sums. One series will be designated by the subscripts 1,2, • • • K, and 
the other by the subscripts I, II, • • • L. From the usual formula for 
correlation, we have then 

(1) r (xi +•••+**) (ajH-h*i) 

__ 2(xx + x 2 H-b x K )(xi + x n H- \-x L ) 

Vs(Xi + x 2 -1-b xk ) 2 V^5(xi + xn H-b zl) 2 

The terms involved in the expansion of the numerator can be system¬ 
atically set down in the following rectangular matrix: 


Sxjxx 

Sxjxn 

Sxixin 

2xxxz, 

Sx 2 xi 

Sx 2 xn 

2x 2 xhi 

2x 2 Xj, 

Sx 3 xi 

2x3X11 

Sx 3 xni 

2x 3 xi, 

Sxjcxi 

SxjcXii 

2xjrxni • • • 

2 x k x l , 


74 
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First, x\ is multiplied by xi, xu ) • • • xl) the same is done for x 2 , giving 
the second row; and so on to xr- The sum of all these LK terms is 
equal to 

2(#i + x 2 H-h xr)(x i 4* xn + • • • + Xl ). 

Each of the terms in this matrix may also be written as a covariance, 
giving 

Nri t i$i$i Nri'iiSiSn ••• Nri t LSiSL 
Nr 2 js 2 si Nr 2t ns 2 sii • • • Nr 2t L8 2 8L 


Nr K ,is K si Nr Kt u8 K 8ii • • • Nr KtL s K SL . 

Using g and G as the general subscripts designating tests, where g varies 
from 1 to if and G varies from I to L, we can write the sum of all the 
terms in the foregoing matrix as follows: 

K L 

5-) 1C N r gG s g s G* 
g=l G=I 

Since the AT is a constant for all terms, we may take it outside, writing 
the following equation for the numerator term of equation 1. 

Equation (2) 

K L 

2(zi + x 2 H-h xk)(xi + xn d-b xl) = N ^ 2 r gGSgSG- 

g=l G=I 

We also follow the same procedure for the denominator. The terms 
in the expansion of 

2(#i + x 2 + • • • + xr ) 2 
may be set down in a square matrix as follows: 


S*! 2 

2X1^2 

SXiXy • • • 

SXjXk 

ZxiX 2 

2x 2 2 

Sx 2 x 3 

Sx 2 x x 

Sx t X3 

Xx 2 x 3 

Sx 3 2 • • • 

Sx 3 xk 


ZxiXr 2 x 2 xr 2 x 3 xk - • • Sxx 2 . 



[Chap, ft 


76 The Theory of Mental Tests 

This matrix may also be rewritten in terms of variances and covariances 


as follows: 

Ns i 2 

iW 1 2S 1 S 2 

NTi 3 SiS 3 

• • • Nr lK siS K 

Nr 12 Sis 2 

Ns 2 2 

Nr 23 s 2 s 3 

• • • Nr 2 K8 2 sjc 

Nri 3 SiS 3 

NT2 3 S2S 3 

Ns 3 2 

• • • Nt 3 K s 3 sk 

NxikSiSk 

Nr 2K s 2 s K 

Nr 3K s 3 s K 

••• Ns k 2 . 


Again the sum of all the terms in this matrix may be written in a 
double subscript notation by using the subscripts g and h to designate 
the tests, having the limits for both g and h be from 1 to K. We may 
thus write 

Equation (3) 

K K 

2(xi +^2 + 3'3^-h X K ) 2 = E E r gh SgSh (where r ee = 1). 

g=l /l = l 

However, it will be noticed that the terms along the principal diagonal 
of the matrix are variances, while the non-diagonal terms are covariances. 
It is sometimes better to use a notation that keeps the variances and 
covariances separate and to write 

K K K 

(4) S(.ri + x 2 + x 3 4-h x K ) 2 - N 22 s g 2 + N X H r gh SgSh 

g^l g=l h=l 

(<7 * h ). 

It is necessary to specify that u g does not equal h” since the cases 
where g does equal h have already been included in the sum of the 
variances. 

By symmetry a term corresponding to equation 4 can be written for 
the second factor in the denominator of equation 1. For this case where 
the limits are I to L, let us substitute I for 1 and L for K in equation 4. 
Also let us change subscripts, using G instead of g and H instead of h. 
Making these substitutions, we have 

L L L 

(5) 2(xi + xii + xih 4-(- x L ) 2 = N X) + N £ X) r 0 H s 0 s H 

(7—1 G-l I 

«? 5* H). 

Using the double subscript notation indicated in equations 2 to 5, 
we can write the formula for the intercorrelation of any two sums in 
terms of the standard deviations and intercorrelations of the unit tests. 
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Substituting equations 2, 4, and 5 in equation 1, and dividing the numer¬ 
ator and denominator by N and using Rkl for r (xi +... + XK) (xi+ ... + XL > in 
equation 1, we have 

Equation (6) 


Rkl = 


K L 

23 23 r gQ s g s o 

g=l 0=1 




/C /C 


»•**«««* 23 % 2 + 23 23 roHsasii 

V g=l g=l /i=l \ 0=1 0=1 //=I 


(g*A) 


(G*H) 


This formula is found in Spearman (1913) and Kelley (1923c). We may 
also write the foregoing in terms of the average variance and the average 
covariance of the unit tests by substituting N times the average for the 
sum, and denoting the average by a bar above the term. 

Equation (7) _ 

^ RI J ( r gG s g s G) 

wl) ^L(V) + L(L - l)6v^W 

It should be noted that this equation is general and precise. It in¬ 
volves no assumptions whatever. It is based only on simple algebraic 
transformations, and it must be verified, barring arithmetical errors. 


Equations 6 and 7 give the correlaiion of any two sums in 
terms of variances and covariances for the unit tests. These 
two equations form the basis for the derivations in this and 
the succeeding chapter. 


3. Effect of test length on reliability (Spearman-Brown formula) 

However, in the usual case of trying to estimate the reliability of an 
augmented test, the average of the variances of the parallel unit tests 
that might be constructed is not available. Likewise the average 
covariance among these unit tests is not available. The only figure 
available is the variance of the first unit test and the reliability of this 
test. If we are willing to assume that the variance of the first unit 
test is a fair approximation to the average variance of all unit tests, 
and that the reliability coefficient times this variance is a fair approxi¬ 
mation to the average covariance, we shall have some values to substitute 
in this formula. It should be noted that, unless we do this, the formula 
cannot be used at all. It should also be emphasized that, if the new unit 
tests have an average variance that is different from the variance of 
the first test, or an average covariance different from ruSj 2 , the new 
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tests are not parallel forms of the original unit test. In other words, 
if the new unit tests are parallel to the original one, the assumption is 
valid and the formula will hold. If the formula does not hold, the new 
tests are not parallel forms of the original one. 

Making the substitution indicated in the foregoing paragraph, we 

shall set _ _ 

si 2 = (*« 2 ) = (so 2 ) 

and _ _ _ 

ruSi 2 = (r g hSgSh) = (tghSgSh) = (r g as g sa). 

We may write the generalized formula giving the effect of increasing 
test length on reliability as follows: 

_ KLr llSl 2 _ 

KL 2 + K(K — ljrn*!* VLsi 2 + L(L — ljrn^ 2 


This general formula may be simplified in several respects. For reli¬ 
ability, a test is assumed to be correlated with another form of the same 
length so that K = L. This means that the two expressions under the 
radicals in the denominator are identical so that the product of two 
square roots may be indicated by simply writing one of the expressions 
without the radical sign. These changes give 


(9) 


Rkk 


K 2 r n s x 2 

Ks x 2 + K(K - l)r n si 2 ' 


Dividing numerator and denominator by Ksf, we have the final 
formula which is 

Kr n 

(10) Rkk = -—— (general Spearman- 

1 + (A — l)r n Brown formula ) ; 

where r xx is the reliability of the unit test, 

K is the number of items in the lengthened test divided by the 
number of items in the unit test, and 
Rkk is the reliability of the lengthened test. 

Making a test K times as long increases the reliability as 
indicated in equation 10. 


This is the generalized Spearman-Brown formula showing the rela¬ 
tionship between test length (K) and reliability. As mentioned pre¬ 
viously, derivations of this equation were published simultaneously by 
Spearman (1910) and Brown (1910). In view of the controversy waged 
around this equation, it should be emphasized again that no assumptions 
were made in deriving it, except that the variance and covariance figures 
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obtained for the first unit test could be used in place of the average 
variance and the average covariance among the unit parallel tests. 
These assumptions are part of the definition of “parallel tests.” 


4. Graphing the relationships shown by the Spearman-Brown 
formula 

By regarding R and r as the variables in equation 10 and if as a param¬ 
eter, we can show that this is the equation of a rectangular hyperbola. 
In order to show this, let us first subtract each side of the equation 
from 1 + 1 /(K — 1), giving 


( 11 ) 


1 + 


K - 1 


- R = 1 + 


Kr 


K - 1 1 + (K - l)r 


We may rearrange terms in the left-hand member and simplify the 
first two terms in the right-hand member of equation 11 to give 


( 12 ) 


1 — R -f- 



K 

K - 1 


Kr 

T+~(K -\)r 


Putting the terms in the right member over a common denominator and 
simplifying gives 


(13) 


1 - R + 


1 


K - 1 


K 

(K - 1)[1 + TK=W] 


We may write the denominator of the right-hand term as (K — l) 2 
[r + 1 /(K — 1)], and then multiply both sides by r + l/(K — 1), 
obtaining the usual form for the rectangular hyperbola (xy = c). 


(14) 



K 

(K- if 


This equation has been graphed in Figure 1. This figure shows the 
relationship between R and r for various values of K . It can be seen 
that for r-values of zero and unity, the value of R is the same as r for 
all values of K. For other values of r, R increases as K increases. It 
can be seen that these hyperbolas have a horizontal asymptote equal to 
1 + 1 /(K — 1). That is to say, as r approaches infinity, R approaches 
a horizontal line 1 /(K — 1) units above one. Also as R approaches 
minus infinity, r approaches negative 1 /{K — 1). Since r and R desig¬ 
nate reliability coefficients, values outside the range zero to one are 
meaningless, and are not shown in the graph. 

Equation 10 can also be graphed by regarding R and K as variables 
and r as a parameter. Such a graph will show how the reliability of 
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r 


Figure 1 . Diagram for equation 14. Shows the augmented reliability ( R) as a 
function of the original reliability (r) for various changes in test length (A'). 


the test increases as test length increases. Dividing the numerator and 
denominator of the right-hand side of equation 10 by r gives 


(15) 


R - 


K 


K-+ 


1 — r 


r 


Subtracting each side from 1 and simplifying gives 

1 — r 


(16) 


1 - R - 



I 


which can be converted readily into the form of the rectangular 
hyperbola 


1 — r 


(17) 


1 — rl 

[1 -«] K + - 

L r J 


r 








Chap. 8] Effect of Length on Reliability (General Case) 81 



Figure 2. Diagram for equation 17. Shows the augmented reliability ( R) as a 
function of the increase in test length (A") for different initial reliabilities (r). 

It can be seen that as K approaches infinity, R approaches unity for 
all values of r. Also as R approaches negative infinity, K approaches 
the vertical asymptote (1 — r)/r. This equation is graphed in Figure 2. 
However, since negative test length and negative reliability coefficients 
have no meaning, this part of the graph is omitted. 

For the purposes of preparing a computing diagram for equation 10, 
both equations 14 and 17 have the disadvantage of being composed of 
curved lines. This necessitates the computation of a great many points 
for each line and the use of an arbitrary smoothed curve for the inter¬ 
mediate points. If equation 10 can be changed into a straight-line form, 
it is necessary to have only two points for each line, which means that a 
practical computing diagram can be readily constructed by anyone 
who has occasion to compute a large number of values using equation 10. 

If we take equation 10, take the reciprocal of both sides, divide the 
right-hand side by Kr } and then subtract unity from each side, we have 
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If we regard the left side as the ordinate, the expression in parentheses 
as the abscissa, and then give K the values 2, 3, - • •, etc., we get the 
diagram shown in Figure 3. It should be noted that the measured 
distances on the ordinate and abscissa are proportional to (1/12) — 1 
and (1/r) — 1, respectively, but the numbers recorded along the ordinate 



Figure 3. Diagram for equation 18. Gives a linear computing diagram for equa¬ 
tion 10, the generalized Spearman-Brown formula ^ — 1 = ~ — 1^ . 

and abscissa are R and r, respectively. This graph shows the reliability 
obtained by increasing the length of the test, when the reliability of 
the unit test is .5 or greater. 

5. Length of test necessary to attain a desired reliability 
Equation 10 gives the reliability of the lengthened test as a function 
of the reliability of the unit test, and the length of the new test. How¬ 
ever, the same equation also shows how long a test needs to be made 
in order to have a specified reliability. For example, on a trial run a 
given test has a reliability of .80, and we wish to construct a test with 
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a reliability of .90 or slightly larger. If the increased reliability must 
be obtained solely by lengthening the test, is it necessary to double, 
treble, or quadruple the length of the test? We can answer this question 
by putting .80 for r, .90 for R> and solving for the one remaining un¬ 
known, K , in equation 10. If we follow this procedure, we find that the 
test must be slightly more than doubled to get a reliability of .90, 
whereas trebling the length of the test would give a reliability between 
.92 and .93. 

Equation 10 can be changed to show the test length needed for any 
given reliability. An explicit solution for K can be obtained most 
readily from equation 18. If we divide through by the left-hand side 
and multiply both sides by X, we have 


(19) 


(1 ~ r n )R KK 

(1 — Rkk^w 


The length of test needed for any specified reliability can be read from 
any of the graphs. Probably Figure 2 is best for this purpose since X 
appears here as one of the variables. 


In order to increase the reliability of a test from r to R , the 


number of items should be multiplied by 


(1 - r )R 
(1 - R)r ’ 


6. A function of test reliability that is invariant with respect to 
changes in test length 

When we compare reliabilities for tests of different lengths, it is 
important to state clearly the precise question to be answered. We may 
ask: “Is this 20-item test (just as it stands) more or less reliable than 
this 100-item test?” In order to answer this question, the reliability 
coefficients of each of the tests should be compared. If the 20-item test 
has a reliability of .81, and the 100-item test a reliability of .87, the 
longer test is the more reliable test. However, we may wish to ask: 
“If the 20-item test could be lengthened to 100 items, how would it 
then probably compare with the 100-item test?” It should be carefully 
noted that this question implies that the test constructor can get 80 
other items comparable with the 20 now in the test, and that the students 
can answer 100 items of this type with essentially the same sort of 
performance they are able to give for 20 items. Substituting X = 5 
and r = .81 in the equation 


Kr 
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R 


4.05 

424 


- .96. 


That is to say, if the 20-item test can be lengthened to 100 items without 
undue fatigue for either the item writers or the test takers, it will have 
a reliability that is definitely greater than the .87 of the present com¬ 
peting 100-item test. However, it is still true that, as things stand, the 
short test has a reliability of .81 and the longer one a higher reliability 
of .87. 

For some general comparison purposes, we may wish to compare 
test reliabilities, allowing for length of test, but may also feel that it 
is rather arbitrary to reduce all the tests to any specified length, such as 
50, 100, or 200 items. Since all the reliabilities approach unity as the 
test length is increased, we easily see that, if we choose the 200-item test 
as the standard, all the reliabilities will be much closer together than if 
we choose the 50-item test as the standard. Also the statistical sampling 
problems become difficult to work out for various extrapolations of 
different amounts. It is possible to devise a quantity that depends on 
test reliability and number of items and is invariant for changes in test 
length, as long as the test reliability increases, with length, according 
to the Spearman-Brown formula. 

Let us use the following notation: 


Rll designates the reliability of a test of length L, 
Rkk designates the reliability of a test of length K. 


The problem then is to find some function, F, such that F(L, Rll ) 
= F{K, Rkk)- First let us express the reliabilities of each lengthened 
test in terms of the reliability of the unit test (r). 


( 20 ) 


Rll = 


Lr 

1 + (L - l)r ’ 


Rkk = 


Kr 

1 + (K - l)r" 


Taking the reciprocal and deducting one from each of these equations, 


we have 

1 

1 + (L - l)r 

- 1--- - 1, 

Lr 


Rll 

(21) 

I 

i _ 1+ (K - l)r i 
Kr 


Rkk 
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Simplifying the right-hand member of each equation and multiplying 
through by, the number of items, we have 

(22) ‘(E-O-'iib- 1 )- 1 ^- 1 } 

The number of items multiplied by one less than the recip¬ 
rocal of the reliability is invariant with respect to test length. 

This, then, is the function mentioned at the beginning of this para¬ 
graph, a function of test length and reliability that is invariant with 
respect to test length, provided that test reliability increases with length 
according to the Spearman-Brown formula. It furnishes a method of 
comparing two tests without making any arbitrary decisions regarding 
the length of test at which the comparison is to be made. It may also 
be that the sampling theory for such a value can be worked out more 
readily than that for the reliabilities that have been increased by use 
of the Spearman-Brown formula. 


7. Summary 

The general formula for the correlation of any two sums may be 
expressed as follows: 

Let one set of tests be designated by the subscripts g or h (g = 1 • • • K, 
h = 1 • • • K), 

let the other set of tests be designated by the subscripts G or H 
((? = /••• L, H = / • • - L), 

s is a standard deviation of one of the unit tests, 
r is the correlation between two of the unit tests, 

Rkl is the correlation of the sum of K tests with the sum of L other 
tests. 


Rkl can then be written as a function of the r ’s and s’ s, 


(0) Rkl 


k L 



22 22 r gOSgSo 
*=1 G=I 

1 K 

K K 

I L L L 

A E 

Sg 2 + 2 H r gh s g s h 

J E S G 2 + E 22 r GBSG$H 

\ g~ 1 

g~l 1 

(g*h) 

V 0*1 0*1 //= I 

(G 9*11) 


Writing the foregoing expression in terms of average variances and co- 
variances and denoting the average by a bar over the term, we have 


Equation (7) 
R 


KL(r g a$ g SQ) 

V K(s g 2 ) + K(K — l)(r C 4 *f**) ^L{s q 2 ) + L(L — \)(xqh*q*b) 
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It should be noted that equations 6 and 7 are general and precise. No 
assumptions regarding parallel tests or any other limitations on the 
nature of the tests were made in deriving them. They are based on 
simple algebraic transformations and will be verified, barring arith¬ 
metical errors in the work. 

The relationship between test length and reliability has been shown. 
Solving explicitly for reliability as a function of test length, we have 

Kr n 

(10) Rkk =- (the Spearman- 

* “ l) rn Brown formula). 

Solving explicitly for test length as a function of reliability, we have 

_ (1 — t h )Rkk 

(1 — Rkk^h 

k (—— l) 

\Rkk ) 

is a function of test length and reliability that can be expected to show 
no systematic changes in value as test length increases. 


(19) 

It was shown that 
( 22 ) 


Problems 

1. From the Spearman-Brown formula, write a formula that will show the test 
length necessary for any specified reliability. 

2. There have been several articles dealing with the experimental verification of 
the Spearman-Brown formula. Study these articles; then summarize and comment 
on them. (See articles by Holzinger and Clayton, 1925; Ruch, Ackerson, and Jack- 
son, 1926; Furfey, 1926c; Wood, 1926c; Remmers, Shock, and Kelly, 1927; Kelley, 
1925; and Gordon, 1924.) 

3. State clearly the assumptions made in deriving the Spearman-Brown formula. 


Test 

Mean 

Standard 

Deviation 

. i 

Number 
of Items 

Relia¬ 

bility 

A 

54.8 

14.7 

100 

.92 

B 

27.9 

10.6 

50 

.94 

C 

10.5 

4.1 

20 

.83 

D 

83.4 

10.1 

60 

.89 

E 

12.3 

3.4 

30 

.53 
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(а) What will be the reliability of test A if it is lengthened to 300 items? 

(б) Estimate the reliability of test B if 25 items are added, making a 75-item test. 

(c) How long would test C need to be to have a reliability of .95? 

(d) Suppose that for test A, we were satisfied with a reliability of .85. How many 
items would be required for this lower reliability? 

(e) How many items would be required to give test D an index of reliability of .90? 
(/) Estimate the index of reliability of test E for triple length. 

(i g ) How many items would be required to give test E a reliability coefficient of .90? 

6. Read and comment on the material in Guilford, Psychometric Methods (19366), 
page 419, on the Spearman-Brown formula. 

6. What is the reliability for a test of infinite length? 
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Effect of Test Length 
on Validity (General Case) 


1. Meaning of validity 

Reliability has been regarded as the correlation of a given test with 
a parallel form. Correspondingly, the validity of a test is the correla¬ 
tion of the test with some criterion. In this sense a test has a great 
many different “validities.” For example, the ACE Psychological 
Examination has one validity for predicting grades in English and a 
different validity for predicting grades in Latin. It is also found in 
studying various validity coefficients for a given test that they vary 
from school to school, and from time to time. In other words, validity 
cannot be regarded as a fixed or a unitary characteristic of a test. As 
new uses for a test are contemplated, new validity coefficients must be 
determined; and, when use of a test is continued, the validity coeffi¬ 
cients must be redetermined at intervals. In the remainder of this 
chapter we shall refer to “test validity” only in the sense that we are 
considering the relationship between test length and its validity for 
predicting a specified criterion. In most practical investigations of a 
test, we should be comparing several different validity coefficients. 

2, Effect of test length on validity 

The general formula for the correlation of any two sums may also be 
utilized to determine the effect of test length upon test validity. We 
shall first consider the case in which the criterion variable is not altered. 
In this case L equals I, since we do not consider the effect of lengthening 
the criterion. Let R(\ -K)i be used to designate the correlation between 
a test of length K and the original criterion variable. In this case the 
general formula (equation 6 of Chapter 8) for the correlation of any two 
sums becomes 

K 

2 r giVi 

g-*i 

Hk k k 

2 s t 2 + £ H r th 8 t 8 h Va? 

*»1 g~l h**l 

(g*h) 

88 


( 1 ) 


R{lK)l 
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The si in numerator and denominator cancel, and the expression re¬ 
maining may be written in terms of averages as follows: 


( 2 ) 


^ K( r &i s g) 

K{lK)l = ■ - . . . - ■ -r =‘ 

V% 2 ) + K(K - 1 )(r gh s g 8 h ) 


Again it should be noted that this expression is precise. It involves 
no assumptions whatever. The numerator is an average involving a 
number of validity coefficients. The denominator involves the reliabil¬ 
ities of the unit tests. However, when we have the data for this formula, 
we can actually compute the validity of the lengthened test. It is 
necessary to estimate the validity of the lengthened test only when the 
data necessary for equation 2 are not available. It is reasonable in 
such a case to assume that the values found for the first test are a rea¬ 
sonable approximation to the average values that would be found for 
all the unit tests. As indicated before, this assumption must be true 
if we succeed in making the new unit tests parallel to the original unit 
test. Setting these assumptions down explicitly, we have 

rnsi = (r g i8 g ), 

(3) Si 2 = (s* 2 j, and 


i’ll*!* = (rghSgSh). 


Substituting equations 3 in equation 2, we have 


(4) R(1K)1 VKl? + K{K - 1 )»■„*/ 

Dividing both numerator and denominator by s x and using Rki to 
indicate the correlation of the new test (which is K times its original 
length) with the original unit criterion, we have 


(5) 


n r x \\/~K. 

Rlcl = Vi + (k - i yTi ’ 


where Rki is the augmented validity coefficient, 

rn is the validity coefficient of the unit test, 
rn is the reliability coefficient of the unit test, and 
K is the number of times the test is increased in length. 


Multiplying the length of a test by K increases the validity 
coefficient as shown in equation 5 . 
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By comparing the preceding equation (equation 5) with equation 10 of 
Chapter 8, we see that the multiplier for the validity coefficient is the 
square root of that for the reliability coefficient. That is, augmenting 
a test so that the reliability will be multiplied by 1.44 will only multiply 
the validity by 1.2. Since the validity coefficient is usually consider¬ 
ably smaller than the test reliability, this usually means that changing 
the length of a test can be expected to have only a very slight effect on 
the validity of the test. 

In order to see readily the effect on validity to be expected from 
increasing test length, equation 5 can be simplified by dividing both 
sides of the equation by rn (the validity coefficient of the unit test) and 
then dividing both numerator and denominator of the right-hand side 
by y/K. This procedure gives 


Rki _1_ 

rn Vl/K + (1 - 1 '/K)r n ‘ 


Squaring both sides and taking reciprocals, we have 



That is, the ratio of the squared validity coefficients is equal to a linear 
function of the reliability coefficient. It can be easily verified from 
equation 10 of Chapter 8 that the reliability of the unit test divided by 
the reliability of the augmented test equals this same linear function of 
the reliability coefficient. Equation 7 is graphed in Figure 1. The ratio 
of the squared validity coefficients is plotted as the ordinate, and the 
reliability as the abscissa. The appropriate straight line is shown for 
several selected values of K . The graph is read by entering at the bot¬ 
tom with the known reliability of the unit test and then moving up to 
the selected value of K , and then horizontally out to the right-hand 
margin. For example, as shown by the dotted lines in the figure, if a 
test with a reliability of .5 is doubled in length, the ratio of the squared 
validity coefficients is .75. That is, the squared validity for the doubled 
test will be one-third larger than the vali dity of the unit test. The 
validity coefficient will be increased by \/l.3333, or 1.16; doubling a 
test with a reliability of .5 will increase the validity coefficient by 16 
per cent. 

By simply changing the scale markings to give the square root of the 
reciprocal, we can read directly the ratio of the augmented to the orig¬ 
inal validity coefficient. These values are indicated on the scale directly 
juScler the heading Rki/tw We see immediately from the graph, for 
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example, that, if the test reliability is greater than .5, making the test 
infinitely long increases the validity by less than 41 per cent. 

Since the same graph gives directly the ratio of the original to the 
augmented reliability coefficient, as can be seen from equation 10 of 
Chapter 8, an additional scale has been added at the extreme right 
giving the ratio of the augmented to the original reliability coefficient. 
This scale is given under the heading Rkk/th. By comparing the scale 



0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1.0 


Figure 1. Computing diagram for equation 7. The relative increase in validity 
or reliability as a function of the original reliability and the amount of increase in 

test length. 

for the reliability and validity coefficients, we see that an increase in 
test length that doubles the reliability coefficient increases the validity 
coefficient by only 41 per cent. 

It is also possible to add another portion to the graph of Figure 1 in 
order to include the original and augmented reliability coefficients in¬ 
stead of merely their ratios. Such a diagram is Figure 2. The easiest 
way to read this graph is first to find the validity coefficient of the unit 
test (rn) in the scale at the extreme right and then to place a ruler on 
the horizontal line for rn. Next enter the bottom left-hand scale with 
the value of the reliability coefficient for the unit test, go up to the 
radiating line appropriate for K , right to the heavy vertical center line, 
down to the ruler (previously placed to indicate rn), and then up to the 
value of the augmented validity coefficient. In the illustration shown 
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O) co r*v to tn tr co co.-« 



Figure 2. Computing diagram for equation 7. The augmented validity as a function of the original validity, reliability, 

increase in test length. 
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(Figure 2), the validity of the unit test is .6, and the reliability of the 
unit test is .7. If this test is trebled in length, its new validity will be 
between .70 and .65 (about .67). 


3. Length necessary for a given validity 

Sometimes in planning new tests it is desirable to know how much a 
test must be lengthened in order to achieve a specified validity coeffi¬ 
cient. It might be noted parenthetically here that, before inquiring 
how much the test must be lengthened to achieve a given validity, we 
might investigate the effect of making the test infinite in length. We 
do this simply by using the lowest of the radiating lines (marked K = «>) 
in the left half of Figure 2. This topic is also discussed in a later sec¬ 
tion of this chapter, Section 5 (validity for a test of infinite length). If 
making a test infinitely long will not achieve the desired validity, we 
know that simply increasing test length will never achieve the desired 
validity. However, if the test of infinite length has a validity higher 
than desired, we see what would happen if the test were only 20, 10, 5, 
or 3 times as long. Again Figure 2 can be used entering it with the 
known validity and reliability coefficients for the unit test and the de¬ 
sired augmented validity. Then the K-value necessary for such an 
augmented validity can be determined. 

In order to check the approximate result obtained with the graph, 
and also for more precise calculation, it is desirable to have an equa¬ 
tion that gives K as a function of Rk i, rn> and r xx . Such a formula can 
readily be obtained from equation 5. Squaring and multiplying through 
by the denominator, we have 

(8) Rki 2 + Kr n R Kl 2 - r n R Kl 2 = Kr n 2 . 


Solving equation 8 for K gives 


(9) 


ftn 2 (l ~ r„) 
rn 2 ~ r n R K i 2 ’ 


where the terms have l,he same meaning as in equation 5. 


Equation 9 gives the test length (K), necessary for a specified 
validity (Rki ). 


The graphical representation of equation 7 given in Figures I and 2 
also holds for equation 9. Since equation 9 is written explicitly for K, 
it is more convenient to use if we wish to know the length necessary for 
a desired augmented validity coefficient. A zero value for the denom¬ 
inator of equation 9 indicates that the test must be made infinite in 
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length in order to achieve the desired Rki- A negative value for the 
denominator indicates that the desired validity cannot be achieved by 
lengthening the test. 


4. A function of validity that is invariant with respect to 

changes in test length 

In the preceding chapter, which treated the effect of test length on 
reliability, we found one function that did not change with test length; 
see equation 22 of Chapter 8. Similarly, if we wish to compare different 
length tests with respect to validity for predicting a given criterion, it 
is desirable to have some function involving validity that does not vary 
with test length. Dividing equation 10 of Chapter 8 by rn and taking 
the square root, we have 


( 10 ) 



K 

1 + (K - lj^’ 


Substituting the left side of equation 10 for the radical in equation 5, 
we have 


(ID 


Rki = rn 



Similarly, by analogy, if the test had been lengthened L times, we should 
have 


( 12 ) 


Rli = ru 


^Rll 

Wn 


from which it follows that 


(13) 


Rki _ Ria 

\/Rkk VRll 

The ratio of the validity coefficient to the index of reliability 
does not change with increase in test length . 


It should be carefully noted that the foregoing relationship between 
validity and reliability holds only when validity and reliability are 
altered by changing the length of the test, without varying the nature of 
the items in the test. There must be no change in the variability of the 
^taking the test. For a discussion of the changes in reliability and 
with changes in the group heterogeneity, see Chapters 10, 11, 
and 12. Chapter 11 includes a discussion of the relationship between 
validity and reliability as the standard deviation of the group changes. 
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5. Validity of a perfect test for predicting the original criterion 

We can extend equation 5 to estimate what would happen to the 
validity coefficient of a test if it were made infinitely long while the 
criterion measure remained the same. We shall need to assume, of 
course, that in lengthening the test each of the new unit tests is parallel 
to the original one. That is to say, they each have the same mean and 
standard deviation as the original test, and the same reliability and 
validity. 

If we let K become infinite in equation 5, it gives the indeterminate 
form oo/oo. However, if we first divide the numerator and denominator 
by y/K, we have 



If we let K approach infinity, equation 14 simplifies to 

(15) Rxi = vkr 

If a test is made infinitely long and hence 'perfectly reliable, 
its validity for predicting the original criterion measure will 
be the original validity coefficient divided by the index of reli¬ 
ability for the original test. 

That is to say, if it is possible to increase a test in length without 
limit, and to do so by adding only parallel forms of the original test, the 
augmented validity can never be higher than indicated in equation 15. 
This equation is a convenient one to use where we desire to know if it is 
worth while to attempt to improve the validity of a test by simply in¬ 
creasing its length. This equation is much easier to apply than equa¬ 
tion 5, and, if the test has a fair reliability, equation 15 will show that 
increasing the length of the test (even to infinite length) will not appre¬ 
ciably affect its validity. If the reliability of the test is reasonably low, 
we find that a reasonable increase in validity can be made by length¬ 
ening the test. Then it is relevant to ask how much longer the test 
should be made. Will doubling or trebling the length of the test prob¬ 
ably give a sufficient increase in validity to be worth the effort? Equa¬ 
tion 5 can be used with K, taking various values, such as 2,3, 4, to see if 
a practicable increase in test length would change the validity suffi¬ 
ciently to be worth while. It may be remarked here parenthetically 
that the usual conclusion from equation 5 is that the increased validity 
obtained from increasing the length of a test is negligible. 
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Equation 15, showing the effect on the validity of making a test 
infinitely long, has been graphed in Figure 3. By looking up the valid¬ 
ity at the bottom of the graph, and then moving up to the diagonal line 


r n 

<y 'S <?> 9 n? 



Figure 3. Computing diagram for equation 15 or equation 17. Correlation between 
two measures when one is increased to infinite length, while the other remains 

unaltered. 


(15) R„i = “7= or (17) Ri„ = 

Vrn Vrii 

for the appropriate reliability, we can read from the column at the right 
the expected validity of an infinitely long test. We see from this graph 
that, if the validity of a test is greater than the square root of its 
reliability, the expected validity for infinite length is greater than unity. 
This is &n unreasonable situation. If any actual figures show a validity 
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greater than the square root of the reliability, the result may be re¬ 
garded as a fluke of some sort that cannot be relied on to repeat itself. 
Such a result should lead us to check computations very carefully to 
see if any error has been made, even though such results can arise with¬ 
out errors in arithmetic. Indeed, in general the reliability of a test 
should not merely be greater than the square of the validity coefficient 
but should also be greater than the validity coefficient. In most prac¬ 
tical situations the validities of a test run much lower than its reliability. 
Tests with reliabilities in the nineties, have validity coefficients that 
might range from .70 to .30 or lower. The dotted lines in Figure 3 show 
that if the reliability and validity of the unit test are .7 and .5, respec¬ 
tively, lengthening the test cannot increase the validity above .6. 

As remarked previously, application of the equations showing the 
probable effect of lengthening a test upon its reliability and validity will 
indicate that not much improvement is to be expected from increasing 
the length of the test somewhat. However, since altering length does 
not have much effect on reliability and validity, this frequently means 
that if we have a very good test, it is possible to shorten it considerably 
without seriously damaging its validity or reliability. Equation 5 may 
also be used with fractional values for K to determine the effect of cut¬ 
ting a test to one-third or one-half its present length. It can be seen 
that shortening a reliable test by one-third its present length will have 
little effect on reliability and validity and may well be considered if the 
reliability is already over .95. 

6. Validity of the original unit length test for predicting a 
perfect criterion 

Equation 11 may be regarded as showing what happens to the cor¬ 
relation between two measures when one of them is increased to infinite 
length and the other remains unaltered. It was indicated that the test 
length was increased, while the criterion measure remained unaltered. 
The same formal relationships would hold if the test remained the same, 
while the criterion measure were increased in length. Thus by symme¬ 
try we can use Rue to represent the augmented validity coefficient and 
m to represent the reliability of the original criterion measure, and 
write 



Equation 16 shows the increase in test validity as the cri¬ 
terion measure is increased in length. 
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If K approaches infinity in equation 16, we have in the limit 
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(17) Rioo = —f=- 

Vn i 

If the criterion measure is made perfectly reliable by being 
made infinitely long y the validity of the original test for pre¬ 
dicting this criterion measure will be the original validity co¬ 
efficient divided by the index of reliability for the original 
criterion measure . 


7. Effect of altering length of both test and criterion 

If we wish to consider the effect of lengthening both the test measure 
and the criterion measure, the general formula for the correlation of 
any two sums applies with very little alteration. We may begin with 
equation 7 of Chapter 8, _ 


Rkl = 


KL(rgGSgSa) 


'/kW) + K(K - l)(r gh s g s h ) Vl( S o 2 ) + L(L - l)(roHs a s H ) ’ 
and make the following assumptions: 

1 . Since the various forms of the test are parallel, s g = 

2. Since the various criterion units are parallel, so = s/j. 

3. The average validity coefficient r g o = rn (the validity of the 
original unit test). 

4. The average test reliability r g h equals the reliability of the original 
unit test (r n ). 

5. The average criterion reliability roH = ni (the reliability of the 
original unit criterion measure). 

Making these substitutions and simplifying gives 

KLr n 


(18) 


Rkl — 


VK + K(K - l)r u VL + L(L - l)r n 
Dividing the numerator and denominator through by KL gives 

rn 


(19) 


Rkl = 




wh«® Rkl = the validity of the test augmented K times for predicting 
the criterion augmented L times, 
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rn = the validity of the unit test for predicting the unit crite¬ 
rion measure, 

rn = the reliability of the unit test, and 
m = the reliability of the unit criterion measure. 

This formula and many variants of it were given by Spearman (1910). 

Equation 19 is the general equation showing the correlation 
of a test K times as long as the original test with a criterion 
measure L times as long as the original one. Equations 10, 
Chapter 8, and 5, 15, 17, and 21 of Chapter 9 are special 
cases of equation 19. 


If we begin with equation 19 and set L = K, we obtain equation 10 
of Chapter 8; set L = I, we obtain equation 5 of Chapter 9; set L = I 
and K = co, we obtain equation 15 of Chapter 9; set K = 1 and L = «>, 
we obtain equation 17 of Chapter 9; and finally, by setting L = K = oo, 
we obtain equation 21 of Chapter 9. 

If in equation 19 we divide by rn, square both sides, and take the 
reciprocal, we have 


( 20 ) -^= [- + (l --Vnl |~-+(l * 

Rkl 2 IK V Kt JLl \ Lf J 


We see that equation 20 is essentially the same as equation 7, except 
that the right side of equation 7 is one linear function, whereas the right 
side of equation 20 is the product of two linear functions. This means 
that the graph of Figure 1 can be complicated somewhat to serve for 
equation 20. In Figure 4 the lower left-hand section gives the value 
for the left-hand bracket of equation 20; the upper right-hand section 
gives the value for the right-hand bracket of equation 20; and the 
lower right-hand section gives the product of these two. For example, 
if the criterion reliability is .6 and L is 2.0, we enter the upper right 
section of the graph with these two values, as shown by the dotted lines, 
and mark with a ruler the appropriate radiating line in the lower right 
section of the graph. For the foregoing values of .6 and 2.0, this line is 
the dashed line PO . Leaving the ruler here for a marker, we enter the 
lower left graph with the values of K and the test reliability. The 
dotted lines illustrate this procedure for the case in which the test re¬ 
liability is .7, and K is 3.0. Thus we see that, if the criterion reliability 
is .6 and the test reliability is .7, then if the criterion measure were 
doubled and the test were tripled, the new validity coefficient would be 
slightly more than 20 per cent above the old one. 
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Again, as was emphasized in Chapter 6, we must note that, in order 
to increase the “test length” effectively both the number of items and 
the test time limits must usually be altered. Also there must be no 



1.0 1.1 1.2 1.3 1.4 1.5 2.0 3.0 4.0 » 

Rkl 


'll 

Figure 4. Computing diagram for equation 19 or equation 20. The relative 
increase in validity due to lengthening both measures by any specified amount. 

serious fatigue or practice effects with the longer working period. In 
each case the objective criterion that must be satisfied before the equa¬ 
tions of Chapters 6, 7, 8, and 9 apply is that each of the unit tests must 
have the same mean, standard deviation, and error of measurement 
anc^|||||||||l»the intercorrelations of the unit tests must be the same. If 
th|P.|fdjperia are met, the tests satisfy the quantitative criteria for 
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parallel tests, and the equations for the effect of test length apply. A 
statistical criterion for parallel tests is given in Chapter 14. 

The problem of how to adjust the relative lengths of several tests to 
maximize the validity of the composite has been solved; see Horst 
(1948) and (1949), and Taylor (1950). 


8. Validity of a perfect test for predicting a perfect criterion 
(the correction for attenuation) 

If both K and L are allowed to approach infinity, several of the terms 
will vanish, giving the formula for the correlation of a test of infinite 
length (or unit reliability) with a criterion of infinite length (and hence 
of unit reliability). This equation is 


( 21 ) 


R ooo 


rn 

Vr n r n 


This is the well-known correction for attenuation. It is not properly 
called a “correction.” Rather it is an estimate of the correlation be¬ 
tween a perfect test and a perfect criterion. 

This formula was given by Spearman (1904a), (1907), (1910), and 
(1913). 

A correlation coefficient “corrected for attenuation ” (R^^) 
may be regarded as (a) the correlation between true scores in 
each of the two measures and (b) the correlation between the 
two measures when each is increased to infinite length {and 
hence given a reliability of 1.00). This correlation is equal 
to the correlation between the original measures divided by the 
geometric mean of the two reliability coefficients. 


Initial interest in the correction for attenuation rose from the belief 
that it gave the “true” correlation between two variables, unattenuated 
because of the use of fallible (unreliable) measuring instruments. It 
was thought that one of the sources of variation in observed coefficients 
of correlation was variation in reliability of tests used. Therefore, if 
coefficients were corrected for attenuation, there would be greater 
agreement between different experiments. With the development of 
factor theory, it became clear that, although variation in reliability 
was one source of disagreement between the results of different experi¬ 
menters, there were many other important sources. In particular, each 
test has a fairly high specific factor which is not duplicated in other 
similar tests, and therefore would be a source of variation in the results 
of different investigators. Also the factor composition of many, if not 
most, tests is complex, and the variation in factor structure from one 
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test to another is another possible source of variation in results of dif¬ 
ferent investigations. Although the notion that the correction for 
attenuation would give invariant results despite fallible tests does not 
seem to have been borne out, the equation is still valuable in giving a 
quick indication of the worth-whileness of attempting to increase valid¬ 
ity by increasing test length. 

It will be noted that equations 21 and 15 are analogous. In equa¬ 
tion 15 the validity is divided by the square root of one reliability co¬ 
efficient. If the result is divided by the square root of a second reliabil¬ 
ity coefficient, we have equation 21. Thus two graphs like that of 
Figure 3 will give a computing diagram for equation 21. These two 
graphs are shown in Figure 5. Enter the graph with the correlation of 
the two tests on the scale at the lower left, move up to the diagonal 
representing one of the reliability coefficients, then to the right to the 
diagonal representing the other reliability coefficient, and then up to 
read the result from the scale at the upper right. The dotted lines 
represent this procedure for the case in which the correlation between 
the two tests is .5, while the reliability coefficients are .8 and .9. In 
this case the correlation between true scores is about .59. 

We should note carefully just what we are doing when using this 
equation. It is an estimate of the correlation between test and criterion 
if both could be made perfectly reliable by lengthening the test and the 
criterion measure indefinitely. Just because we might get a validity of 
.90, for example, by lengthening the test and criterion does not mean 
that we have such a validity coefficient with the original test. How¬ 
ever, if the coefficient of validity “corrected for attenuation” is near 
unity, it does show that the major problem to work on for better pre¬ 
diction is the most appropriate means of increasing reliability of test 
and criterion measures. If (when corrected for “attenuation”) the 
validity coefficient is still in the neighborhood of some reasonably low 
value, such as .6 or .7, we can conclude that further work in that par¬ 
ticular field should take two directions. First, it is desirable to try to 
improve the reliability of both criterion and test measures. The coeffi¬ 
cient corrected for attenuation shows the maximum validity we can 
reasonably hope for by such efforts. If this validity is still a consider¬ 
able distance from unity, we can also look for new tests to add to the 
prediction battery. If used in this way, to determine what work should 
be done next—whether to search for new tests and also to improve the 
reliability of tests already in use, or only to improve reliability of tests 
now in use—the correction for attenuation is a valuable tool in direct¬ 
ing future research. However, the use of this correction for the sole 
purpose of being able to report a higher validity coefficient—accom- 
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panied by the implication that this higher coefficient has already been 
achieved—is distinctly misleading and erroneous. When Spearman 
(1904a) first presented the correction for attenuation, it was vigorously 
criticized by Pearson (1904) and others. An excellent discussion of the 
correct and incorrect use of this formula has been presented by Thouless 
(1939). 

9. Summary 

Several equations have been developed showing the relationship 
between test length and validity. The most general equation enables 
us to estimate the correlation between two tests, when one is increased 
to K and the other to L times its original length. Using Rrl to desig¬ 
nate this correlation, we have 



where rn is the reliability of the original test, 

ri x is the reliability of the original criterion measure, and 
rn is the correlation between the two measures (the validity of 
the original test). 


All the other equations derived in Chapter 9 are special cases of this 
one. 

If one of the tests is increased in length, while the other remains un¬ 
altered, we have the special case of equation 19 in which L = 1. If we 
use Rk i to designate the new validity, obtained by lengthening only 
one of the measures, we have 


Rki — 


rnVi 

r l + {K - l)r» ’ 


where rn designates the reliability of the measure that is to be increased 
K times in length. The reliability of the test that remains of unit length 
does not enter into the equation. 

Equation 5 may also be written to show explicitly the amount of 
increase in length necessary for any desired new validity. This equa¬ 
tion is 

fijq 2 (l — rn) 
rn 2 - Rm 2 rn 


(9) 
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If the test denoted by subscript 1 becomes infinite in length, whereas 
the test denoted by I is unaltered, we have the special case of equa¬ 
tion 19, in which K is » and L is I. The resulting correlation (/?«, i) is 
given by 

(15) R*i = ~ t ^=- 

V 7-11 

If the test designated by I is made infinite in length, whereas the test 
designated by 1 is unchanged, we have the special case in which K is 1 
and L is oo. The corresponding equation is 


(17) 



It was also shown in developing equation 13 that the foregoing expres¬ 
sions, equations 15 and 17, do not change as the length of the test 
changes. Thus, for comparing the relative performance of tests that 
vary in length, the validity coefficient divided by the index of reliability 
may be used. 

The correlation between true scores or between measures, each of 
which has been made infinitely long (and hence perfectly reliable), is 
the special case of equation 19, in which both K and L become infinite. 
This correlation (Ran) is given by 


( 21 ) 



(the correction for 
attenuation). 


Problems 

1. Under what conditions can the validity of a test be equal to its reliability? 

2. Prove that the validity of a test can never be greater than its reliability. What 
assumption was used? 

3 . From the equations showing the relationship of test length to validity and to 
reliability, determine the relationship between test reliability and validity as the 
length of the test is increased while the criterion is unchanged, that is, write 

/(fn, n 2 , Rkk , Rk2) * 0. 

4 . Fifty mathematics problems in free-answer form are rewritten as multiple- 
choice items. If the reliability of the free-answer form is .88 and the reliability of 
the multiple-choice form is .93, what correlation would be expected between the 
two forms? 
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5 . If the reliability of a test is raised from .80 to .90 by lengthening the test, a 
validity coefficient of .60 for this test would be expected to increase to what value? 

6 . 


Test 

Mean 

Standard 

Deviation 

Number 
of Items 

Relia¬ 

bility 

Validity 
Criterion 
(School Grade 
Average) 

A 

16.5 

4.4 

30 

.72 

.68 

B 

12.6 

3.5 

20 

.77 

.50 

C 

53.2 

10.7 

100 

.88 

.68 

D 

32.3 

8.8 

50 

.91 

.71 

E 

66.3 

17.2 

120 

.95 

.75 


(Criterion reliability = .70) 


(a) If test A is lengthened to a 100-item test, what would you expect the new 
mean, standard deviation, reliability, and validity to be? (Assume that the 
criterion test has not been altered.) 

(b) If test B is lengthened to increase its reliability to .90, how many new items 
will be needed? What will the new validity be, assuming that the criterion 
test remains unchanged? 

(c) Which of the five tests is, in its present form , best for use in predicting school 
grade average? 

( d) Which of the five tests seems to be intrinsically closest to this criterion? 

(e) If there is time and material for a 200-item test in each case, which of the 
tests would probably perform best at the new length? 

(/) If the reliability of the criterion is .70, what is the correlation between true 
criterion scores and true test scores for each of the tests? 

(g) If it were possible to improve the criterion by methods analogous to increasing 
test length, so that the criterion reliability were raised from .70 to .90, what 
would be the new validity of test C in its present form? 

( h) To raise the criterion reliability from .70 to .90 corresponds to an increase in 
criterion test length of about how many times? 

(i) Give the true variance, and error variance for test C. Estimate the true and 
error variances for test C if it is increased to 300 items. 

(j) If test D is increased to a 150-item test and the criterion test is doubled in 
length, estimate the new reliabilities and validity. 

( k ) What will be the validity of test E if its reliability is increased to unity? 

7 . Test X has a validity coefficient of .65 and reliability .75, whereas the validity 
of test Y is .67 and its reliability .95. Each of these tests is a 50-item test. Which 
type of item (that in test X or in test Y) would probably show the greater validity 
for a 200-item test? 
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8 . Prove that, if a test of n items is a subtest of a test with m items (n < m), the 
correlation r nm is 



where r is the reliability of a unit test. 

9. Derive the equation for the correlation between true scores in two different 
tests. Use only the assumption that “true score plus error score equals gross score,” 
that “all correlations with an error score are zero/’ and some appropriate definition 
of parallel tests. 



10 

Effect of Group Heterogeneity 
on Test Reliability 


1. Introduction 

The correlation between two variables is markedly affected by the 
range of the variables. For example, if we correlated height and weight 
for a group of persons who ranged from 5 feet, 6 inches, to 5 feet, 8 inches 
in height, we should find that the correlation is very low, as illustrated 
in Figure 1. It is, of course, unlikely that we should make such a pecul¬ 
iar selection of persons for the purpose of correlating height and weight. 
However, the effect would be similar if, for example, we were to correlate 
height and weight for pupils in the fifth grade, as compared with cor¬ 
relating height and weight for pupils in grades one to twelve. The 
correlation between mental age and chronological age will be much 
greater for a school population than for a given grade. 

In a similar manner, restriction of range lowers a reliability coeffi¬ 
cient. If a mental test is given to a random sampling of children aged 
six to sixteen, the range of scores will be very great, and the reliability 
coefficient will be high. If, on the other hand, the test is given to a 
group of eighth-grade students who have a rather narrow mental ability 
range, the reliability coefficient will be much lower. 

By making certain reasonable assumptions, it is possible to estimate 
the amount of change in reliability that will result from any given change 
in the group variance. Also by solving the equations for variance it is 
possible to estimate the amount of change in variability it would have 
taken to produce any given change in reliability. 

First let us recall that the observed variance of a test has two com¬ 
ponents. It is the sum of the true variance and the error variance. It 
is possible to increase the observed variance by increasing either the 
true variance or the error variance. In all the illustrations given above, 
for example, it was the true variance that changed. That is to say, the 
actual mental ability range of a group of six- to sixteen-year-olds is 
greater than the mental ability range of a group of twelve-year-olds. 

108 
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In general, when we give a test to two different groups and find that the 
standard deviation of one group is larger than that of the other group, 
we are dealing with a case where the true variance of one group is 
greater than that of the other group. The only other event that could 
account for the difference in group standard deviations would be an 
alteration in the error variance. This would mean that the test was 



X 

Figure 1. Illustration of effect of changes in group heterogeneity on correlation. 


given under good conditions in one group and under poor conditions in 
the other group. It is to be noted that ordinary errors in test procedure, 
such as allowing too little time (erroneously calling time 10 minutes too 
soon, for example), would affect everyone’s score in the same direction, 
and would not bring about an increase in the error variance of the test. 
An increase in error variance means that the scores of some persons 
were raised considerably above where the score should have been, 
while for other persons the score was considerably lowered. This sort 
of effect might be produced, for example, if the room was large and the 
students in the first rows received good directions and some help in 
answering the questions, whereas the students in the last rows did not 
understand the directions and did not receive any special help so that 
their scores were lower than they would have been under the standard 





110 


The Theory of Mental Tests [Chap, 10 

conditions for administering the test. Great variability in lighting con¬ 
ditions in different parts of the testing room might also produce the 
same effect. Poor general lighting could not produce such an effect. It 
can be seen that, although it is possible to have events that increase 
error variance, it is much more likely that a difference in observed vari¬ 
ance of two groups is due to a difference in the true variance of the per¬ 
sons in those two groups. If the test administered to the two groups is 
the same and the same standard conditions are observed, it is highly 
unlikely that the error variance is affected. We shall first show the 
relationship between observed variance and observed reliability for two 
groups, on the usual and reasonable assumption that the differences in 
the two groups are to be attributed to differences in true variance. 


2* Effect of changes in true variance on reliability 

Let s x and r xx designate the standard deviation and reliability for one 
group, and Sx and Rxx designate the standard deviation and reliability 
for the other group. From the definition of the error of measurement 
(see equations 42 [Chapter 2], 38 [Chapter 3], and 20 [Chapter 4]), we 
may write 

(1) $e x = ^ 2 ?^/1 r xx 

and 

(2) s ex = Sx 1 — Rxx- 

Since we are assuming that the entire difference in observed variance is 
due to a change in true variance, it follows that the error variance of x 
is equal to the error variance of X. Therefore we may equate equations 
1 and 2 and write 

(3) s x y/ 1 — r xx = Sx^ 1 — Rxx- 


Equation 3 can readily be solved for either s x or Sx to give an expres¬ 
sion that will show the amount of change in standard deviation of ob¬ 
served scores required to account for any given change in reliability, 
solely on the basis of a .group difference in true variance. Solving for 
Sx, we have 


(4) 


Sx 


= s -ylr^ 


Rxx ’ 


where 8 X is the standard deviation of group x on a given test, 
r xx is the test reliability for group x, 

Rxx is the reliability of the same test for group X, and 
Sx is the standard deviation of group X on the same test. 
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If r xx is the reliability of a test for group x, Rxx is the reli¬ 
ability of the same test for group X , and this difference is 
attributed solely to a difference in the true variance of the two 
groups , the observed standard deviation of group X (Sx) is 
given by equation 4- 

We may also determine the amount of change in reliability to be 
expected from any given change in observed variance, on the assump¬ 
tion that this change is due solely to a difference in true variance. 
Squaring equation 3 and solving for Rxx gives 

s 2 

(5) Rxx = 1 — 

OX 

where the terms have the same definitions as in equation 4. 

If s x 2 is the variance of group x on a given testy and Sx 2 is 
the variance of group X on the same test, and if this differ¬ 
ence is attributed solely to a difference in true variance of the 
two groups, the reliability of the test for group X (Rxx) is 
given by equation 5 . 

Equations 4 and 5, or slight variants of them, have been presented by 
Kelley (1921), (1923c), (1927), by Otis (19226), Holzinger (1921), 
Thurstone (1931a), Peters and Van Voorhis (1940), Crawford and 
Burnham (1946), and others. The only assumption used is that the 
error of measurement is invariant with respect to variations in range of 
ability of the group tested. This assumption was suggested by Kelley 
(1921), Otis (19226), Nygaard (1923), and others. It was seriously 
questioned by Holzinger (1921). It seems reasonable to say that in 
some cases the error of measurement is the same for two groups, whereas 
in other cases the error of measurement for a given test may vary with 
the ability of the group. Equations 4 and 5 may be used if there is 
reason to believe that the error of measurement is about the same for 
the two groups under consideration, and may not be used if there is 
evidence to indicate that the error of measurement is radically different 
for the two groups. 

A different formula based on explicit restriction in a correlated variable 
has been presented by Davis (1944) and Kaitz (19456). 

Wherever possible it is a go od idea t o test the basic assumption 
directly. Compute s x \/ 1 — r xx and Sx\^ 1 — Rxx to see if they are 
approximately alike for the two groups. In order to make a precise 
judgment about the similarity of two errors of measurement, appro¬ 
priate statistical tests of significance of differences would need to be 
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devised and used. A solution to this problem has been given by Green 
(19506). 

Otis and Knollin (1921) pointed out that the error of measurement 
was superior to reliability as a test statistic. Kelley (1921) also recog¬ 
nized some of the disadvantages of the reliability coefficient and the 
advantages of the error of measurement. He discussed, and suggested, 
some solutions for the problem of establishing a suitable unit for the 
error of measurement. Basic statistics on any test should include the 
error of measurement as well as the reliability. 

It should be noted that learning can also have the effect of increasing 
or decreasing the test score variance for a given group. In general, if 
we began testing the group when they knew very little, all scores would 
be low, the mean would be low, and the variance would be low. We 
should say that the test was too difficult for this group. As the members 
of the group learned more, the average score and the score variance 
would increase for a time. Then as learning continued beyond this 
stage, we should eventually find that the test was too easy for the group. 
All persons would make perfect or near perfect scores; hence the mean 
score would be high, and the standard deviation of scores would be 
small again. 

Copeland (1934) has pointed out that teaching a class so that the 
students begin to approach a perfect score will lower not only the test 
variance but also the test reliability. It is also clear that, if the test 
were initially too difficult for the group, so that the scores were uni¬ 
formly near zero, we should expect the first effect of learning to be an 
increase in the test variance, and hence an increase in test reliability. 
It must be emphasized, however, that the effect of such changes in test 
variance are not related to the discussion presented in this chapter. 
There is no reason for believing that this effect is the equivalent of 
selecting the members of a group in such a way that the true variance 
mil be altered and the error variance unaffected. As we approach the 
floor or the ceiling of a test, the error variance is clearly affected, but 
the theory presented in this chapter has nothing to do with such effects. 
The theory presented here is based on the assumption that we are work¬ 
ing entirely within the appropriate range for the test and that no “floor” 
or “ceiling” effects are present. 

Figure 2 is set up in terms of 


Sx _ / 1 — r xx 

s x V 1 — R xx 


From it we can read the proportional change in standard deviatioi 
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corresponding to a given change in reliability, or the change in reliability 
that corresponds to a given ratio of standard deviations. 

For example, if r xx is .64 and Rxx is -91, we locate .91 across the top 
of the graph, .64 along the left side, and note that these two lines inter- 

Rxx 



Figure 2. Computing diagram for equations 5 and 6: 


Sx^_ _ 1 -r JX . 
s x 2 1 — Rxx 

sect on the diagonal line labeled 2.0. This means that a change in re¬ 
liability of .64 to .91 would occur if the observed standard deviation 
were doubled and the entire increase were due to a change in true var¬ 
iance. In a similar manner, we observe that an increase of 25 per cent 
in the standard deviation, if due entirely to a change in true variance, 
will be expected to raise the reliability coefficient from .75 to .84. 
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Graphs for computing the change in reliability with a change in group 
variance have also been presented by Iiulon (1930), Cureton and Dun¬ 
lap (1929), and Toops and Edgerton (1927). 


3. Effect of changes in error variance on reliability 

Occasionally students are confused by the fact that in the usual 
formulas for correlation, the standard deviation appears in the de¬ 
nominator. They point out that, if the denominator of a fraction is 
increased, the fraction is decreased; therefore, the argument runs, “an 
increase in standard deviation should be expected to lower the reliabil- 
ity.” If we take the equation for- the true variance (see equation 20 of 
Chapter 3), 

(7) * s 2 = s^r xx 


we may divide through by s x 2 and write the reliability coefficient as 


( 8 ) 


St 

r xx = — = 


St 


S x 2 S t 2 + S 2 


Thus we see that the only way for the denominator to increase while 
the numerator remains constant is for the true variance to remain con¬ 
stant. This necessarily means that the entire change must be due to an 
increase in the error variance. It is true that, if the observed variance 
of a test changes, owing solely to an increase in the error variance , the 
reliability of the test will decrease. It is possible to derive the equation 
showing the relationship between observed standard deviation and re¬ 
liability on the assumption that the true variance is constant. No one, 
so far as I know, has ever reported a case where such an equation could 
reasonably be used. However, we may derive the equation here simply 
to emphasize the fact that an increase in observed standard deviation 
will have the effect of increasing reliability, if it is due to an increase in 
true variability, and will have the effect of decreasing the reliability if it 
is due to an increase in error variability. 

If the true variance of the test for group x is equal to that for group X , 
we may write the two equations for true variance (see equations 39 of 
Chapter 2 or 20 of Chapter 3) 

(9) s lx = 8 x VrZ and 

(10) S tx — Sx'v'Rxx- 
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Since the true variances are in this case assumed to be equal, we may 
write 

( 11 ) s x \^ x = Sx^Rxx- 

Squaring both sides and dividing through by Sx 2 gives 

s x 2 

(12) Rxx = r xx , 

ox 

where the terms have the same definition as in equation 4. 

If s x 2 is the variance of group x on a given test , and Sx 2 is 
the variance of group X on the same test, and if this differ¬ 
ence is attributed solely to a difference in error variance 
for the two groups, the reliability for group X (Rxx) is given 
by equation 12. 

From equation 12, we easily see that, as the variance (Sx 2 ) increases, 
the reliability (Rxx) decreases. The change is very drastic. For ex¬ 
ample, if the standard deviation (Sx) is double the standard deviation 
($*), the reliability (Rxx) is one-fourth the reliability (r xx ). 

It should again be pointed out that the assumption that a 
change in observed variance is due entirely to a change in 
error variance is very unreasonable. It will not occur except 
with very peculiar and very careless testing methods. 

4. Conditions under which the error of measurement is 
invariant with respect to test score 

The derivations presented in this chapter have depended in general 
on the assumption that the error of measurement is constant for a given 
test regardless of the variability of the group to which the test is given. 
This assumption will be true in general only if the error of measurement 
is the same regardless of the magnitude of the test score. Does the 
error of measurement change in some systematic fashion as the magni¬ 
tude of the test score changes? To study this problem analytically, 
we may proceed by expressing the error of measurement as a function 
of test score. The solution presented in this section is the one given by 
Mollenkopf (1948), (1949). 

Let us regard the score for individual i as made up of two equivalent 
parts, for example, 


(13) 


x i = x il + X i2' 
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The subscripts 1 and 2 designate equivalent halves which are such that 


(14) 

where 


(17) 


and 


(18) 


$1 $ 2 } 


(15) 

/ 22 

- \K 

a' = a" 

and 


(16) 

Tto 

11 

to 

when 



P = 


P' 


AT 

22 ■ r «i 3 
»—1 

iV $, 3 

N 

22 a; <2 3 

*—1 

ATs 2 3 

N 

22 *»i 4 
=1 

iV Sl 4 

N 

22 *.2 4 

i —1 


<ff - 1, 2); 


Using the relationship shown in equation 10, Chapter 6, we may write 
the standard deviation of the total test (£*) as follows: 

(19) S x 2 - 2si 2 (l + r), 

where 

N 


r = 



Chap. 10] Group Heterogeneity, Effect on Test Reliability 117 


From equation 30, Chapter 6, the reliability of the total test (R) may 
be written as 


( 20 ) 


R - 


2 r 

1 + r 


From equation 37, Chapter 3, the error of measurement (£„) may be 
written as 


( 21 ) 


Se 2 = fl, 2 (l - R). 


To express S e in terms of Si and r, we substitute equations 19 and 20 in 
equation 21 and simplify, obtaining 


( 22 ) 


Se 2 = 2s 1 2 (l - r). 


Using equation 14, we may write the standard deviation of the differ¬ 
ences (ar,i — x i2 ) as 


(23) 

Thus we see that 


(24) 


N 


Z 


(*»i 


- x i2 ) 2 


N 


2«i 2 (1 - r). 


S. 


Z (*<i - Xilf 

i=» 1 


N 


The square of the error of measurement for the total test is equal to 
the standard deviation of the differences of parallel halves. * Thus we 
may regard the squared difference (xn — x&) 2 as the “error” for indi¬ 
vidual i, and the sum of these errors as the error of measurement for 
the test. Let us define this error for individual i by 

(25) y% = (Xu #,* 2 ) . 

Let us now approximate y* as closely as possible by a second-degree 
function of x f designated 

(26) y% = ox^ "H bx% *4“ c, 
where a, b, and c are chosen so that 

N N 

(27) Z (Vi - Vi) 2 = Z (Vi - a*i 2 ~ ~ c) 2 

1-1 «—1 


is a minimum. 
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To minimize this expression, differentiate successively with respect to 
a, b, and c, and set the derivatives equal to zero. This procedure gives 
the equations 

N N N N 

a Y a - ,- 4 + b £ x i 3 + c Y x ? = Y x i 2 Vu 

1—1 1=1 i=l 1=1 


(28) 


N N N N 

a 11. 3 + !» E x? + f Z 3 '. = Y x iVi> 

l —l 1 = 1 1=1 1=1 


N N N 

a 53 x ? + & Y x i + r.N <= Y Vi- 
1 = 1 1—1 1 = 1 

The solution for a, 6, and c in determinantal form is 


(29) 


a — 


b = 


c — 


Ex 2 y 

Ex 3 

Ex 2 

2 xy 

Ex 2 

Ex 

2y 

Ex 

N 

Ex 4 

Ex 3 

Ex 2 

Ex 3 

Ex 2 

Ex 

Ex 2 

Ex 

N 

Ex 4 

Ex 2 y 

Ex 2 

Ex 3 

Exy 

Ex 

Ex 2 

2 V 

N 

Ex 4 

Ex 3 

Ex 2 

Ex 3 

Ex 2 

Ex 

Ex 2 

Ex 

N 

Ex 4 

Ex 3 

Ex 2 y 

Ex 3 

Ex 2 

Exy 

Ex 2 

Ex 

2 y 

Ex 4 

Ex 3 

Ex 2 

Ex 3 

Ex 2 

Ex 

Ex 2 

Ex 

1 V . 
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The problem now is to express a, b , and c in terms of the moments of 
the x-score distribution. The terms 2a;, 2a; 2 , 2a; 3 , 2x 4 , and 2 y can be 
expressed readily as follows: 


(30) 

2x = 0, 

(31) 

2x 2 = NS X 2 , 

(32) 

2x 3 = NS x 3 a x , 

(33) 

Sx 4 = NS X %, 

and 


(34) 

= NS X \ 1 - R), 

where 

N 


Z*. 3 

(35) 

ax = ~isJ’ 


N 


Z x i 4 

(36) 

p* = \L r- 


This completes the solution, except for the expressions 2 xy and 2 x 2 y. 
We must now find expressions for these terms as functions of the mo¬ 
ments of the x-score distribution. 

In order to do this, it is necessary first to find the value of the third- 
and fourth-degree product moments, 2 xi 2 x 2 (or 2xix 2 2 ), 2x x 3 x 2 (or 
2xix 2 3 ), and 2xi 2 x 2 2 . 

To simplify 2x 1 2 x 2 , we proceed as follows. We consider the regres¬ 
sion of x 2 on X\ y and write x 2 as the sum of the score predicted from x x 
and the residual error designated e 2 . This gives 

(37) x 2 = r xi + e 2 . 

Substituting this value for x 2 in Sxi 2 x 2 , and noting that «i = s 2) we have 

(38) hxi 2 x 2 = rSxi 3 + Ze 2 Xi 2 . 

Let us assume that 

(39) r e2Zi 2 = 0* 
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From the gross score formula for correlation, 

_ 2XY - NX? 

T * V ~ V(SX 2 - JVX 2 )(SK 2 - N? 2 ) ’ 
we note that if r xy = 0, then 2XF = N%Y. Thus, if r ejxi » = 0, 



Since 2e 2 is zero, it follows that 

(40) 2e 2 Xi 2 = 0. 

Substituting equations 17 and 40 in equation 38, we have 

(41) 2xi 2 x 2 = Nrsi 3 a'. 

To evaluate 2xjX 2 2 , we assume that 

(42) r «l*2* = ®» 

and by a corresponding procedure find that 

(43) Sxix 2 2 = Nrsi 3 a'. 

To simplify 2xi 3 x 2 , we proceed by substituting for x 2 from equatioii 
37, noting that s x = s 2 , and writing 

(44) Sx! 3 x 2 = rI,Xi* + Se 2 Xi 3 . 

As before, if we assume that 

(45) W = 0, 

it follows that 



Since 2e 2 is zero, it follows that 

(46) Se 2 xi 3 = 0; 

hence, substituting equations 18 and 46 in equation 44, we have 

(47) 2xj 3 x 2 = Nrsi 4 P'. 

In like manner, by assuming that 

(48) W = 0 
we may show that 

(49) Sxix 2 3 = Nrs\ 4 P’. 
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To simplify Xxi 2 x 2 2 , we again use equation 37, note that (s 2 /si « 1), 
and write 

(50) Vx x W - Z; x x 2 {rx x + e 2 ) 2 . 

Expanding and taking the constants .outside the summation, we have 

(51) 2 x 2 x 2 = r 2 2xi 4 + 2£i 2 e 2 2 + 2r'2x l 3 e 2 - 
If we assume that 


(52) 

^Xl 2 f2 2 0, 


then 


2xj 2 ' 

rsc 2 2 i 


SxjV = N 

— 


N J 

N J 


¥ 


We see that the first term in brackets is the variance of x x and the sec¬ 
ond is the variance of the error made in estimating x 2 from x t . Thus 
we may write 

(53) SsiV - Ns.Wi 1 ~ r 2 ). 

Substituting equations 18, 46, and 53 in equation 51, noting that Si = s 2 , 
and simplifying, we have 

(54) 2*iW - Nsi*(r 2 P' + 1 - r 2 ). 

Let us now write the skewness index for the half test (a') as a func¬ 
tion of that for the total test (a x ). From equation 13, 

(55) 2x 3 = Sfa + x 2 ) 3 . 

Using equation 35 and expanding equation 55, we have 

(56) NS 3 a x = 2#i 3 + ZZx 2 x 2 + 32xix 2 2 + 2z 2 3 . 

Substituting equations 15, 17, 41, and 43 in equation 56 and simplifying, 
we have 

(57) S x 3 a x = 2s 1 V(l + 3r). 

Solving equation 20 explicitly for r, we have 

R 


Solving equation 19 for $i and substituting from equation 58, we have 

S x 2 (2 - R) 


(59) 


4 
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Substituting equations 58 and 59 in equation 57 and simplifying, we 
have 


(60) 


, (1 + R)V2-R 

a x = a -, 

2 


from which we may write 

(61) a! 


2a x 

(1 + R)y/2 — R 


To write the kurtosis index for the half test ft as a function of the 
corresponding index for the total test £ x , we use equation 13, and write 

(62) 2s 4 = 2(*i + x 2 )\ 


Using equation 36 and expanding equation 62, we have 

(63) NS X % « 2a?! 4 + 42z 1 3 x 2 + 62^ V + iZx^ 3 + 2z 2 4 . 

Substituting equations 16, 18, 47, 49, and 54 in equation 63 and simpli¬ 
fying, we have 

(64) S x % - 2*V(1 + 4r + 3r 2 ) + 3(1 - r 2 )]. 


Substituting equations 58 and 59 in equation 64 and simplifying, we have 


/1 + R\ 

(65) Px = \ 2") 

Solving explicitly for ft gives 

(66 > '’-"■(its) 


3(1 - R) 

3(1 - R) 
~ 1 + R m 


We now have all the equations needed to solve for 2 xy and 2 x 2 y so 
that equation 29 can be expressed entirely in terms of moments of the 
total score distribution and the reliability of the test. Multiplying 
equation 13 by 25, we have 

(67) S xy = 2(zi + x 2 )(*i - x 2 ) 2 , 


which expands to give 

(68) hxy = 2xi 3 + Sx 2 3 — 2 xi 2 x 2 — Sxjx 2 2 . 

Substituting equations 15, 17, 41, and 43 in equation 68 and simplify¬ 
ing gives 


(69) 


2 xy = 22Vsi 3 o'(l — r). 
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Substituting equations 58, 59, and 61 in equation 69 and simplifying, 
we have 

(70) 2xy = NS x 

To express 'Lx 2 y in terms of moments of the total score distribution, 
we use equations 13 and 25 to give 

(71) 2 x 2 y = 2(*! + x 2 ) 2 (x l - x 2 ) 2 . 

Expanding, we have 

(72) Zx 2 y = So^! 4 + 2z 2 4 - 2Xx x 2 x 2 2 . 

Substituting equations 16, 18, and 54 in equation 72 and simplifying, 
we have 

(73) 2x 2 y = 2Ns 1 4 ^' - 1)(1 - r 2 ). 

Substituting equations 58, 59, and 66 in equation 73 and simplifying, 
we have 

(74) 2 x 2 y = NS X 4 (“ |) (0* ~ 2 + R). 


Substituting equations 70, 74, and 30 to 34 in equations 29 and sim¬ 
plifying, we have 

(1 - - 3 - a 2 ) 


a — 


(75) 


c = 


(1 + R)(fi m - 1 - a x 2 ) 

(1 — R)2S x a x 
(1 + Rm - 1 - a* 2 ) 

(1 - R)S x 2 (p x R - a 2 R + 2 - R) 


(1 + R)(p x - 1 - a* 2 ) 

Using these values for o, 6, and c in equation 26, we have 

(l -R) 


(76) y = 


[03* - 3 - 


(1 + R) 03* - 1 - a/) 

+ 2S x a x x + 03*R ~ a* 2 R + 2 - R)S* 2 ]. 
When a* is zero, that is, in a symmetrical distribution, we have 

(1-R) 


(77) j/ = 


(1 + R)(l 3* - 1) 


{(0* - 3)x 2 + <J3 X R+ 2- R)S, }. 
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When kurtosis is equal to that of the normal curve, that is, 0 X * 3, we 
have 

Equation (78) 

.3/ - „ 2 7 {-«x 2 x 2 + 2S x a x x + (2 R - a *R + 2)S* 2 ). 

(1 + R )(2 - a x 2 ) 

For a symmetrical (a x = 0), mesokurtic (f$ x = 3) distribution, we have 

(79) y « S x 2 (l - R ). 

In this case the error of measurement is constant as test score varies. 

We see from these equations that for the case of zero skewness and 
kurtosis of 3, the average error of measurement is the same regardless 
of the score. However, for distributions that are positively or negatively 
skewed, or for a kurtosis greater or less than three, we should expect 
the error of measurement to vary with the magnitude of the test score. 
In addition to presenting the theoretical derivation given above, Mol- 
lenkopf (1948), (1949) has presented empirical verification to show that 
the error of measurement does vary in general in accordance with the 
indications of equation’76. 


5. Summary 

The effect of group heterogeneity on test reliability has been derived 
on the assumption that the error variance is the same for the two groups, 
the entire difference in observed variance being attributed to a difference 
in true variance of the two groups. 

Solving explicitly for ope of the standard deviations, we have 


(4) 


Sx 


I l-r xx 

Sx \ll-R xx 


Solving explicitly for one of the reliability coefficients gives 

(5) R xx = 1 — ~~ (1 — r xx ). 

or 

In these equations s x 2 is the variance of group a; on a given test, 

r xx is the reliability of the same test for group x , 

S x 2 is the variance of group X on the same test, and 
Rxx is the reliability of the same test for group X. 


A computing diagram for these equations is shown in Figure 2. 
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Mollenkopf (1948), (1949) has shown that, if we assume the test has 
been divided into two parallel halves so that 


(14) 

Sl = s 2 , 

(15) 

a! — a", 

(16) 

II 

and if the errors 
so that 

of estimate are unrelated to the independent variable 

(39) 

= 6, 

(42) 

T eiX2 2 “ 6, 

(45) 

r eiZi 3 = 6, * 

(48) 

^*ei*2 3 == 6, 

(52) 



then the best fitting quadratic to express the error of measurement as 
a function of test score is 


(76) 


~ — 2 - [(Ac - 3 - a,V 

(1 + R)(fi X 1 “ <*£ 2 ) 

+ 2 S x a x x + (fi x R - a x 2 R + 2 - R)S/\, 


where R is the corrected parallel halves reliability, 

S x is the standard deviation of the distribution of test scores, 
a x is the skewness of the distribution of test scores, and 
fi x is the kurtosis of the distribution of test scores. 


According to this derivation the error of measurement is constant 
with respect to variation in test score if and only if the test score dis¬ 
tribution has a. skewness of zero and a kurtosis of three. The error of 
measurement has a minimum for a leptokurtic and a maximum for a 
platykurtic distribution of test scores. 

It should be noted particularly that equation 76 follows from the 
assumption that the two halves used for computing reliability were 
parallel halves (that is, from equations 14, 15, and 16, and from the 
assumptions of equations 39, 42, 45, 48, and 52). It also should be 
noted that, if the conclusions of equation 76 do not apply in any given 
case, it must follow that one or more of the foregoing eight assumptions 
do not hold for that case. That is, either the halves used for computing 
reliability were not parallel, the errors of estimate correlated with the 
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squares or cubes of the independent variable, or the squares of the 
errors of estimate correlated with the squares of the independent 
variable. 

According to Mollenkopfs derivation, if the distribution of test scores 
has zero skewness and kurtosis of three, the error of measurement is 
invariant with respect to changes in magnitude of test score. The 
error of measurement has a minimum for a leptokurtic and a maximum 
for a platykurtic distribution of test scores. 


Problems 

1. Assume a set of test scores each of which has been divided into comparable 
halves for purposes of obtaining a split-half reliability. Designate these halves by 
x a and Xb, d = x a — Xb (the difference between a person’s score on part a and part b). 
The total score on the test is the sum of the halves (s = x a -f %b)- Assume that the 
halves are comparable so that their means and standard deviations are identical. 

(а) Write r Xa xb terms of the standard deviations of s and d . 

(б) Write the reliability of the total test in terms of the standard deviations of s 
and d. 

(c) Express the variance of d in terms of the reliability coefficient and the variance 
of test scores. 

(i d ) Assume that selection of cases occurs by rejecting persons with high or low 
total scores, which would have the effect of changing the variance of s without 
altering the variance of d. Write the new reliability coefficient in terms of the 
old reliability coefficient and the two total score variances. 

(e) Show that the standard deviation of d is the error of measurement of the test. 

2. In one study of tests, A, B, C, D, and E, the following results are obtained: 


Test 

Mean 

Standard 

Deviation 

Number 
of Items 

Relia¬ 

bility 

A 

18.4 

4.2 

30 

.72 

B 

28.9 

9.8 

60 

.96 

C 

37.2 

8.1 

50 

.90 

D 

63.7 

10.4 

100 

.86 

E 

39.2 

11.5 

75 

.92 


(а) Another investigator reports administering test A to a new group and finding 
a mean of 25.3 and a standard deviation of 8.4. About what reliability would 
you expect the test to have for this new group? 

(б) It is reported that test B has been administered to a new group and the reli¬ 
ability coefficient is only .90. What would account satisfactorily for this 
lowered reliability without indicating any faults of test administration or 
scoring? 
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(c) Test C is administered to a new group with the following results: mean 31.9, 
standard deviation 12.7, reliability .96. Are these results in reasonable agree¬ 
ment with those reported in the table for test C? 

(d) Test D is reported as having a mean of 68.2, a standard deviation of 14.6, and 
a reliability of .98. Are these results in reasonable agreement with those 
reported in the table for test D? 

(e) If the report on test D also stated that the reliability was based on a corrected 
odd-even correlation and that the time allowance for the test had been changed, 
would you infer that the time allowance had been increased or decreased? 
(A brief survey of the chapters on experimental methods of determining relia¬ 
bility and on speed versus power tests may help answer this question.) 

(/) A teacher wishes to use test E for sectioning a class, and finds a mean score of 
45.3 and a standard deviation of 3.9. What comment would you make on this 
proposal? 

3. Study the equation for estimating a change in reliability due to a change in 
group variance given by Dickey (1934). Comment on this equation. 

4 . Write Davis' (1944) equations for the special case in which the “restricting 
variable "is “true score." 



11 

Effect of Group Heterogeneity 
on Validity (Bivariate Case) 


1. Illustrations of selection 

In addition to affecting the reliability coefficient of a test, the hetero¬ 
geneity of the group tested will also affect the validity coefficient. For 
example, if in Figure 1 the abscissa represents a test and the ordinate a 
criterion, the validity coefficient for the total group will be much greater 
than the validity coefficient for the restricted portion of the group 
included between the two dotted lines. The validity coefficient would 
be lowered in a similar manner if the selection were made upon the basis 
of the criterion variable. 

It should be noted that here again we are assuming that the change in 
variability is due mainly to a change in true variance. The actual per¬ 
sons at the upper or the lower end of the scale are removed, which 
means that the true variance is lowered. In this section we shall not 
consider the case in which there are changes in observed variance due 
to changes in error variance. As pointed out in Chapter 10, such an 
assumption is quite unreasonable. 

In considering the effect of selection of cases upon the intercorrelation 
between two tests it is important to note that this effect will vary with 
the nature of the selection procedure. In any practical situation the 
actual selection procedures are usually complex, and to a great degree 
unknown. We can only investigate the situation and make the most 
reasonable guesses possible regarding the selection procedure operative 
in any particular instance. 

For example, if an intelligence test is given to all applicants for admis¬ 
sion to a college, and only those with a score greater than 0.5 sigma are 
accepted, we have a clear case of selection on the basis of the test. Simi¬ 
larly, if a business concern uses a selection test and accepts only the 
upper 60 per cent, we have a clear-cut case of selection on the basis of 
the test. Usually such clear cases do not occur. The college admits all 
students with a score over 0.5 sigma, provided they do not have poor 
grades or a bad recommendation from their high school principal. 

128 
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Likewise, the college may reject all applicants with a score of less than 
0.5 sigma unless they have an exceptionally good high school record 
and excellent recommendations. In most if not all practical situations 
it is impossible to find out just what weighted combination of the avail¬ 
able variables was used for selection. In many cases, however, it is 



Figure 1. A diagrammatic illustration of the change in correlation with a change 

in standard deviation. 

clear that a given selection test was one of the major items in the selec¬ 
tion procedure so that the results found by assuming that selection was 
solely on the basis of the test will not be far from the correct estimate. 

There may also be times when the criterion itself is the selective 
device. For example, if we are using a given test to predict college 
grades, it may be that students with a grade less than C or less than D 
are dropped. Then selection is clearly on the basis of the criterion. 
Likewise if a manufacturing concern wishes to develop a test that will 
predict a given production record, and dismisses employees if the pro¬ 
duction record falls below a given minimum standard, we have a clear 
case of selection on the basis of the criterion. 

In any practical situation there are usually several selection devices 
at work. This fact may suggest that, if we considered the case of selec- 
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tion on the basis of two or more variables (multivariate selection), we 
should have equations that would be more applicable to practical situa¬ 
tions. 1 Careful investigation of selection procedures show, however, 
that there are numerous extenuating circumstances that sometimes 
override the strict selection rules. In other words, the equations given 
to “correct for selection,” whether univariate or multivariate selection, 
are only approximations to the practical situation. The equation that 
will give the closest approximation to the selection facts of a particular 
situation is the one to use. 

In short, there is no completely satisfactory substitute for a well 
set-up experiment, in which a number of selection tests are given to a 
group. All of the group without any selection are then admitted to the 
training program and given the criterion measure under identical condi¬ 
tions . If these conditions are observed, the necessity of “correcting for 
homogeneity” is completely avoided, and we can make very simple and 
straightforward comparisons of the relative merit of different selection 
procedures. However, many practical situations arise in which it is 
not possible to have such complete control over the experimental condi¬ 
tions. If we are dealing with such situations and wish, for example, to 
compare the validities of two tests, one of which was used for selection 
and the other was given after selection, the simple zero-order validity 
coefficients are definitely misleading. It is necessary either to make a 
correction for the results of selection , taking care to select the equations 
that are most nearly appropriate for the case in hand or to use some 
index , such as the error of estimate , that is not affected by the selection 
procedure . 

In the material that follows we shall consider only the case of uni¬ 
variate selection. We shall consider first the bivariate case, in which 
we are interested in the correlation between two variables (X and Y) 
and selection has been on the basis of one of these variables. Next we 
shall consider the trivariate case, where we are interested in the correla¬ 
tion between Y and Z, when selection has been on the basis of a third 
variable ( X ). 

2. The distinction between explicit and incidental selection 

Considerable confusion and error have been caused by the failure to 
distinguish carefully between two types of selection. We have first the 
direct selection of cases on the basis of a given variable. Those who are 
above the critical score are admitted, and those below the critical score 
are rejected. This selection is referred to as explicit selection. Second, 
we have an indirect selection effect on one variable brought about by 

1 Multivariate selection is discussed in Chapter 13. 
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explicit selection on another correlated variable. For example, if a 
given college rejects all applicants who are below the 50 percentile 
score on the ACE Psychological Examination, the result will be the 
selection of a group of persons who would score high on the Ohio Psycho¬ 
logical Examination, even though the Ohio Psychological is never given 
to that group. This occurs because scores on the two examinations are 
positively correlated. We shall refer to the selection effect on the Ohio 
Psychological in this case as incidental selection. Explicit selection on 
a given examination results in incidental selection on all tests correlated 
with that examination. 

In order to avoid the confusion that has appeared in much of the 
literature on correction for the effects of selection, we shall treat the two 
cases separately. First, however, we shall consider the basic assump¬ 
tions common to both types of selection for the bivariate case. 


3. Basic assumptions for the bivariate case 

Figure 1 illustrates this case. Our problem is to find what parameters 
are invariant from the curtailed to the extended distribution. These 
parameters can then be used to bridge the gap from one distribution to 
the other. If selection is on the basis of the x variable, the regression 
line of y on x will not be systematically affected, and can be assumed to 
be the same for the curtailed and extended distribution. In Figure 1 
we see that the mean y for a given x is not altered by explicit selection 
on x . Since the regression of y on x is the line through these means, we 
see that, if the regression is perfectly linear, the assumption will hold 
exactly. Also, from inspecting the diagram, we see that explicit selec¬ 
tion on x will markedly alter the mean x for a given y and hence will alter 
the regression of x on y. If we designate the curtailed group by x and y 
and the extended group by X and Y, the foregoing assumption may be 
written by putting down the equation of the regression line of y on x 
for each group: 

Sy 

(1) y = r X y — X, 

s* 


( 2 ) 


t = Rxy 


Sy 

Sx 


X. 


Since it is assumed that the predicted or average y ( Y) for a given x (X) 
is the same in both cases, the slopes of the two regression lines are equal, 
and we may write 


T X y 



Rxy 



(3) 
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From inspection of Figure 1, we see that the dispersion about the re¬ 
gression line of y (Y) on x ( X) will not be affected. That is to say, not 
only is the mean y for a given x the same, but the dispersion of y’s for a 
given x is the same in both groups. 1 A little geometric consideration 
will show that, if the selection is explicitly on x as shown in the diagram, 
the dispersion of the x’a for a given y cannot remain the same for all 
values of y. In fact for each and every value of y (Y) in the middle 
range, the dispersion of x for the curtailed group is much less than the 
dispersion for the complete group. From the foregoing considerations, 
we see that, when there is explicit selection on x , the error made in esti¬ 
mating y from x is the same for both the complete and the curtailed 
group. We may thus write the expressions for the two errors of estimate 
and set them equal to each other as follows: 


(4) 

Syx — Sy'' 

^1 - r X y 2 , 

(5) 

Sy-X = Sy 

VT - Rxy 2 , 

(6) 

1 T xy == 

Sy's/ 1 — Rxy 2 ' 


4. Variance known for both groups on variable subject to 
incidental selection 

In the usual case of correction for restriction of range, we have com¬ 
plete information on one group (usually the curtailed group) and we 
have, or can estimate to a reasonable accuracy, one of the variances for 
the other group (usually the more heterogeneous group). That is to 
say, in the typical case we have values for r xy , s y , s X} and one standard 
deviation for the other group, either Sx or Sy . Unless we know these 
four values, the problem cannot be solved. 

Let us use the subscript x to designate the variable subjected to 
explicit selection and y to designate the variable subjected to incidental 
selection, as indicated in Figure 1. First we shall consider the case in 
which both variances are known for y, the variable subject to incidental 
selection. Here we have given r xyi s x , s V) and S Y . The problem is to 
express Sx and Rxy in terms of these four given values. 

The solution for Rxy can be obtained directly from equation 6. 
Squaring both sides and dividing by Sy 2 gives 

(7) = 

or 

1 This assumption follows from the usual assumption of homoscedasticity. This is 
the assumption that the dispersion of y for a given x is the same regardless of the 
value of x t and is basic to many of the theorems of statistics. 
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Solving equation 7 explicitly for Rxy gives the final result 



where r xy is the correlation between x and y for one group, 

8 y is the standard deviation of the variable subjected to inciden¬ 
tal selection for the same group, and 
Sy is the standard deviation of the same variable for the other 
group—the group for which an estimate of the correlation 
(Rxy) is desired. 

If the variance of a variable subject to incidental selection is 
known for two groups (s v 2 and Sy 2 are known), and the cor- * 

relation between the incidental and explicit selection variables 
is known for one group (r xy is known), equation 8 should be 
used to estimate the correlation between these two variables for 
the second group. 

Equation 8 or slight variants of it has been presented by Kelley 
(1923c), Garrett (1947), Guilford (1942), Crawford and Burnham 
(1946), Thorndike (1947), and others. 

It should be noted that nowhere in the previous derivations has it 
been assumed that Sy was greater than s y . In a broader sense the 
lower-case subscripts (x and y), which were originally assigned to the 
“curtailed group” as shown in Figure 1, may be taken to designate the 
group for which complete information is available. Then the upper-case 
subscripts (X and Y) designate the group for which only one standard 
deviation is available and for which additional information is sought. 
Usually we shall have complete information on the group with the 
smaller variance and shall wish to estimate the correlation for the un¬ 
restricted group, the one with the larger variance. The equations, 
however, are equally applicable if we have complete information for 
the unrestricted group and wish to estimate the correlation for various 
sorts of restriction. For example, we may know the validity of a given 
test in an unrestricted group and may wish to estimate the validity of 
that same test for use in a second university that has higher entrance 
standards, and hence gets a group of students with a larger mean and 
smaller standard deviation. 

A computing diagram for equation 8 is shown in Figure 2. This dia¬ 
gram is set up in terms of equation 7. It shows the value of the ratio 
of the standard deviations (Sy/s y ) and the values of the two correlation 
coefficients. In order to use this diagram, we find the diagonal line 
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corresponding to the standard deviation of Y divided by the standard 
deviation of y ; then we locate the correlation (r xy ) at the bottom of the 
diagram, follow up to the appropriate diagonal line, and then to the 
right to read the value of Rxy • For example, if the ratio Sy/s v is 1.2, 
and r xy is .45, the expected value of Rxy is .64. It is also possible to 



Figure 2. Showing the change in correlation as a function of change in variance 
of the incidental selection variable. 


use this chart to find the ratio of standard deviations needed to account 
for a given difference in the correlation between x (X) and y (7). Locate 
the value of the smaller correlation on the abscissa, and the larger cor¬ 
relation on the ordinate, and note the diagonal line that corresponds to 
the intersection of these two lines. 

In constructing Figure 2, we measure the ordinate and the abscissa 
in terms of 1 — r xy 2 , 1 - Rxy 2 (from the upper right-hand corner). 
However, the ordinate (on the right) and the base line are labeled in 
terms of r xy and Rxy • The diagonal lines have been drawn to correspond 
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to the variance ratio but have been labeled in terms of the ratio of 
standard deviations. 

In addition to estimating the correlation for the XY group, it is 
sometimes also desirable to estimate the standard deviation of the 
variable subjected to explicit selection. In our present notation, this is 
the standard deviation of X. It can be estimated readily by solving 
equation 3 for Sx , obtaining 


(9) 


RxySys x 

OJC = - 

T X ySy 


If we square the foregoing equation and substitute the value of Rxy 
from equation 8, we have 


r 0 s/i s Y 
= i - (i - o A 

L <Sr 2 J Sy 


(10) Sx 2 
which simplifies to 

(11) Sx 2 = S x 2 fl - ~ 

L r‘ 


S Y 2 s x 


2 r 2 > 
• xy 


+ 


1 Sy' 


]■ 


The terms in the brackets may be combined and the square root 
taken, giving 


( 12 ) 


Sx 


SjVSr 2 — s„ 2 + r xv 2 s u 2 

V X ySy 


where s x is the standard deviation of the variable subjected to explicit 
selection in the group for which complete information is 
available, 

Sx is the estimate of the standard deviation of the same variable 
in the group for which only Sy is available, and the other 
variables have the same definitions as for equation 8. 


If complete information (s X) s y , and r xy ) is available for one 
group , and only the variance of Y (the variable subject to in¬ 
cidental selection) is known for a second group , equation 12 
should be used to estimate the variance of X (the explicit selec¬ 
tion variable) in the second group . 


5. Variance of both groups known for variable subject to 
explicit selection 

Let us turn now to the situation where we know the variance for both 
groups for the explicit selection variable. This case is much more com¬ 
mon than the one previously discussed. For example, if we give a selec- 
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tion test to a group of applicants, use this test score to select the upper 
K per cent of applicants, and then admit this upper K per cent to col¬ 
lege or to an industry so that performance records can be secured for 
the upper K per cent, we should then have the variance of the selection 
test for both the applicant and the selected group. That is to say, the 
variance of both groups would be known for the explicit selection var¬ 
iable. This case occurs frequently so that the equations developed in 
this section will be the ones that are most generally useful in estimating 
the effects of selection on validity coefficients. 

As before, we shall use x or X to represent the explicit selection 
variable, and y or Y to represent the incidental selection variable, in 
accordance with the symbols of Figure 1. In the previous notation we 
have values for r xy , $ x , s y , and Sx - The problem is to solve for Rxy 
and Sy in terms of these four known values. 

As before, it must be remembered that we are not assuming that s x is 
smaller than Sx- The equations developed will apply when s z is smaller 
than Sx and will also apply when s x is larger than Sx- In the notation 
used here, the lower-case subscripts designate the group for which we 
have complete information (two variances and the correlation), whereas 
the upper-case subscript designates the group for which we know only 
the variance of the explicit selection variable. 

In order to obtain the equation for RxY) we may first solve equation 3 
for Sr, obtaining 


(13) 


T X ySySx 

RxySx 


and then substitute this value for Sy in equation 6, obtaining 

(14) s u V 1 - r zy 2 = Vl - R X y\ 

RxyS x 


Dividing both sides by s„, squaring both sides, and segregating Rxy on 
one side of the equation gives 

1 - W _ «.»(! - r xv 2 ) 

{ } Rxy 2 Sx 2 r xv 2 

The simplest way to graph this function is to divide both numerator 
and denominator on the right side by r 2 , obtaining 


1 

Rxy 2 



(16) 
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Thus we may express the two standard deviations as a ratio and graph 
the function as indicated in Figure 3. Solving explicitly for Rxy gives 



Figure 3. Showing the change in correlation as a function of the change in variance 
of the explicit selection variable. 

Putting all the denominator over Sx 2 r xy 2 ) inverting, and taking the 
square root gives 



where the terms have the same definitions as for equations 8 and 12. 
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If complete information (s x , s y , and r xy ) is available for one 
group , and only the variance of X (the variable subject to ex¬ 
plicit selection) is known for a second group , equation 18 
should be used to estimate the correlation between X and Y 
for the second group (Rxy)- 

Equation 18, or slight variants of it, was first derived by Pearson 
(1903a). It has also been presented by Kelley (1923c), Holzinger (1928), 
Thurstone (1931a), Thorndike (1947), Crawford and Burnham (1932), 
and others. 

Sometimes it is also desirable to estimate the standard deviation for 
the second group, of the variable subject to incidental selection (the 
value of Sy)- It may be noted that the value of Sy is given by equation 
13, except for the fact that this equation contains the term Rxy, which 
is not known. However, the value Rxy is given by equation 18. Sub¬ 
stituting equation 18 in equation 13 gives 

r, r xy$ySx 

oy — ' , 

_ SxSxr X y _ 

VSx 2 r xy 2 + s x 2 - s 2 r x 2 

simplifies to 

Sy = 8 y y/l — r xy + r xy 2 (Sx 2 /s x 2 ), 

the terms have the same definitions as for equations 8 and 12. 

If complete information (s X) s y , and r xy ) is available for one 
group , and only the variance of X (the variable subject to ex¬ 
plicit selection) is known for a second group , equation 20 
should be used to estimate the variance of Y (the incidental 
selection variable) for the second group . 


(19) 

which 

( 20 ) 
where 


6. Comparison of variance change for explicit and incidental 
selection 


In order to compare the change in variability of the variable on which 
there is explicit selection with the change in variability of the variable 
on which there is incidental selection, we can rewrite equation 20 as 
follows: 


(21) 



The percentage of change in variance of the variable subject to 
incidental selection is equal to r xy times the percentage change 
in the variance of the variable subjected to explicit selection . 



139 


Chap. 11] Group Heterogeneity and Validity 

It should be noted that, if both Sy and Sx are available, it is possible 
to check by means of equation 21 to see if the proper relationship holds. 
If this relationship does not hold, the selection probably was not entirely 
and consistently made on variable x. 



S * 

Incidental 

Sy 


Figure 4. Computing diagram showing the relationship between the change in 
variance ratios for the explicit selection variable (designated by X and x) and the 
incidental selection variable (designated by Y and y). 

Figure 4 gives the relationships of equation 21. From this graph it 
is possible to determine the variance ratios of the explicit and incidental 
selection variables that correspond to any given correlation r xy . The 
diagram indicates, for example, that, if the variance ratio for the explicit 
selection variable is 1.6 and the correlation r xy is .90, the variance ratio 
for the incidental selection variable is 1.49. 
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It should be noted that there is no need to rework the foregoing 
derivations for the case where variable y was subjected to explicit selec¬ 
tion. Simply let x (X) in the foregoing equations stand for the variable 
on which explicit selection occurred. It may be either the criterion or 
the test. Then note whether the variance of the complete group is 
known for the variable on which explicit selection occurred (Sx 2 ), or 
for the variable on which incidental selection occurred (Sy 2 ), and use 
the equations appropriate to the information available. 


7. Relationship between reliability and validity for incidental 
selection 


In Chapter 9 we saw that the ratio of the validity coefficient to the 
index of reliability does not change with increases (or decreases) in test 
length (see equation 13 of Chapter 9). Similarly, when considering the 
effect of changes in group heterogeneity, it is possible to find a relation¬ 
ship between validity and reliability that does not change as the het¬ 
erogeneity of the group changes. 

First let us consider this effect for the variable subject to incidental 
selection. As before, we shall use y to designate this variable and x to 
designate the variable subject to explicit selection. As noted in equa¬ 
tion 6, when y is subj ect to incidental and x to explicit selection, the 
quantity s y \ /1 — r xy is invariant with respect to explicit selection on 
vari able x. As indicated in Chapter 10, equation 3, the quantity 
SyV 1 - r yy (the error of measurement) for y does not change as the 
heterogeneity of the group changes. 

Since each of these quantities is invariant with respect to changes in 
group heterogeneity, their ratio is also invariant. Dividing one quan¬ 
tity by the other, canceling the term s y , and squaring the remaining 
fraction, we have 


( 22 ) 


C - 


1 “ Vyy 

1 _ r 2 ' 
1 • xy 


where C is arbitrarily used to designate this constant, 
r yy is the reliability of the test y , and 

r xy is the correlation between y and the explicit selection var¬ 
iable (x). 


If x is subject to explicit and y to incidental selection , then 


for any type of explicit selection on x the ratio 
constant. 


1 — r 


vv 


1 


IS 


• xy 
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8. Relationship between reliability and validity for explicit 
selection 

In order to obtain a relationship between r xx (the reliability of x) and 
r xy (the correlation between x and y) that does not change with explicit 
selection on x, we make use of the following assumptions: 

1. The error of measurement for £ is invariant with respect to explicit 
selection on :r (equation 3, Chapter 10), 

1 7 * 4*2 = C{. 

2. The error made in estimating y from x does not change (equation 6), 

S y 1 7*2“ ^2- 

3. The slope of the regression of y on x does not change (equation 3), 



r xy = c 3 . 


These three equations can readily be combined so as to eliminate the 
standard deviations of x and of y. The first equation multiplied by the 
third and divided by the second gives an expression in which the stand¬ 
ard deviations cancel out, leaving a constant that we may designate 
as C". 


(23) 


C' = 


,Vi - 


VT 


where r xx is the reliability of the variable subject to explicit selection, and 
r xy is the correlation of this variable with any other variable. 


If x is subject to explicit and y to incidental selection , then for 


any type of explicit selection on x the quantity 


Txy V^l V xx 

V 1 - r x y 


is constant. 


9* Summary 

The variable directly used for selection has been termed the explicit 
selection variable and designated by x or X. The correlated variable, 
which is affected only indirectly because of its relationship with the 
explicit selection variable, has been termed the incidental selection 
variable and designated by y or Y. 

The basic assumptions for the bivariate case are first that the slope 
of the regression of y on x is equal to the slope of the regression of Y 
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on X, which is given in equation 

(3) ’■•'(i)-**'-©’ 

and second that the error made in estimating y from x is the same as 
that made in estimating Y from X f which is given in equation 


( 6 ) 


By \/1 T x 


S Y V 1 - Rxy 2 . 


There is no need to specify that one group is the extended and the 
other the curtailed group. The same equations apply either for estima¬ 
tion from a restricted to an extended range or from an extended to a 
restricted range. The convention was adopted that the lower-case 
letters should stand for the group on which complete information was 
available. In other words, it was assumed that r xy , s x , and s y were 
known. We then have two types of cases. 

First we considered the case in which the variance of both groups is 
known for the incidental selection variable (Sy is known). For this 
case, we have 


( 8 ) 

and 

( 12 ) 


Rxy 


Sx = 




xy 


2 jV 

' n 5 


' Sy 2 


Sx^Sy 2 — Sy 2 + T X y&^ 


TxySy 


Second we considered the case in which the variance of both groups 
is known for the explicit selection variable (Sx is known). For this case 
we have 

(18) Rxr ~ Ssr " 


and 

( 20 ) 


V Sx 2 r xv 2 + Si 2 - Sx 2 r xy 2 


Sr = s v Vl - r x 2 + r x 2 {Sx 2 /s 2 ). 

Two computing diagrams were presented for these selection equations, 
Figure 2 for equation 8 and Figure 3 for equation 18. 

In order to demonstrate that the effect on the standard deviation of 
the explicit selection variable (x or X) was greater than the effect on 
the standard deviation of the incidental selection variable, this rela¬ 
tionship was shown in 

Figure 4 was presented to illustrate this equation. 
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Just as there is a relationship between validity and reliability of a 
test that is maintained as the length of the test is varied (see equation 13, 
Chapter 9), there is a different relationship between reliability and 
validity as the heterogeneity of the group is varied. The relationship 
between the reliability of the incidental selection variable and its 
validity for predicting the explicit selection variable is given by 


( 22 ) 


1 ~ r v 

1 Yxi 


: equals a constant. 


The relationship between the reliability of the explicit selection variable 
and its validity for predicting the incidental selection variable is given by 


(23) 


■Vi 


vT=~ 


equals a constant. 


Problems 

Assume that a population is selected on the basis of test scores in such a way as to 
change the variance of test scores. 

1. Write the equation for estimating reliability in the new population as a function 
of test variance in the new population, and the test variance and reliability for the 
old population. 

2. Write the corresponding equation showing the relationship between the test 
variances and validities for both populations. (The criterion is subject only to inci¬ 
dental selection.) 

3. Write the equation showing the relationship between test reliability and test 
validity for a given test, as the population is subject to selection on the basis of test 
score. ( Suggestion: Eliminate variances in the two preceding equations.) 

4. Compare this relationship with the expected relationship between reliability 
and validity when test length is varied for a fixed population. 


Data for Problems 5-10 


Test 

Mean 

Standard 

Deviation 

Number 
of Items 

Relia¬ 

bility 

Correlation 
of Each Test 
with the Same 
Criterion 

A 

19.3 

4.1 

30 

.74 

.69 

B 

10.2 

4.2 

20 

.87 

.70 

C 

58.3 

13.8 

100 

.92 

.77 

D 

27.8 

9.7 

50 

.95 

.74 

E 

68.1 

24.8 

130 

.97 

.75 

Criterion 

117.8 

20.1 

200 

.90 
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5. Assume that selection has occurred on the basis of test A. 

(a) Estimate the validity that test A would have in an unselected group for which 
the standard deviation of test A was 6.7. 

(b) Estimate the standard deviation of the criterion for this unselected group. 

6 . Assume that only persons with criterion scores over 90.0 were included in this 
sample, and that otherwise there has been no selection of cases. It is also known that 
for an unselected group the standard deviation of test A is 6.7. 

(а) Estimate the validity of test A for the unselected group. 

(б) Estimate the standard deviation of the criterion for the unselected group. 

7. A group of high-scoring persons on test B are selected, and it is found that the 
standard deviation of test B for this limited group is 3.2. 

(а) What validity would test B probably have for this restricted group? 

(б) What would be the variance of the criterion for this restricted group? 

8. Compare tests C and D on the assumptions that the data are from the same 
subjects, that no selection occurred on test C, whereas subjects were screened on the 
basis of test D scores from an unselected group with a standard deviation of 12.0 
on test D. 

9. What is the standard deviation of the difference between actual criterion 
scores and criterion scores predicted from scores on test E? 

10. If we screen a group using test E scores and obtain a selected subgroup with 
standard deviation 15.0 on test E, what will (a) the validity and (6) the error of 
estimate be for test E in this subgroup? 
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Correction for Univariate Selection 
in the Three-Variable Case 


1, Introduction 

Let us consider the three-variable case. If there is explicit selection* 
on test X f what effect will this have upon the correlation between two 
variables (F and Z) subject to incidental selection because they each 
correlate with X? It may be noted that insofar as we are interested in 
the correlation between X and F, or between X and Z, the equations 
for the bivariate case given in Chapter 11 will apply directly. Also, 
if we are given the variance of X and wish to obtain that of Y or Z and, 
conversely, if we are given the variance of Y or Z and wish to obtain 
that of Xy the bivariate equations given in Chapter II may also be 
used. There is only one problem regarding variances that is unique to 
the three-variable situation: given the variance of F, solve for that of Z, 
or vice versa. That is, we have the problem of how to express the vari¬ 
ance of one incidental selection variable in terms of the other incidental 
selection variable. The other problem that occurs in the three-variable 
but not in the two-variable case is determining the correlation between 
two incidental selection variables (the correlation Ryz ). 

2. Practical importance of the three-variable case in 

univariate selection 

Let us consider a practical illustration of three variables with uni¬ 
variate selection. Suppose we are trying out a new test for the selection 
of college students. Let us call this new test, test Y. The students 
available for a validation study have already been admitted to college 
on the basis of selection on test X. Test F is then administered to the 
freshman class, and the new test is correlated with college grades. 
College grades is the criterion score which for present purposes we may 
designate as variable Z. Since tests X and F do not correlate perfectly, 
it is evident that all the freshman class will have passed test X (since 
it was used for admission), but some of the freshman class will fail 
test F. That is, the range of scores on F will be greater than the range 
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of scores on test X , since X is subject to explicit selection and Y only 
to incidental selection. Equation 21 of Chapter 11 and Figure 4 of 
that chapter show that the variance of the explicit selection variable is 
reduced more than the variance of the incidental selection variable. 
We are interested in comparing the validity of text X and test Y under 
similar sampling procedures. If an unselected group had been admitted 
to college, would the validity of test X have been higher or lower than 
the validity of test 7? 

To illustrate the importance of this problem, let us consider a hypo¬ 
thetical case in which two admissions tests (X and Y) have the same 
validity, the same correlation with grades (Z). That is, both tests 
have the same validity for an unrestricted group. Then for the restricted 
group, the validity of the test actually used for selection (the explicit 
selection variable) would always be lower than the validity of the test 
not used for selection (the incidental selection variable). In other words, 
if we considered only the zero-order uncorrected correlation coefficients, 
we should always reach the conclusion that the test not being used for 
selection was better than the one being used. When X was used 
for selection, Y would have the higher validity (uncorrected); and, when 
Y was used for selection, X would have the higher validity. 

It follows that, whenever a test is already being used for selection, 
it is relatively easy to try out another test on the selected group and find 
that it is better than the test already in use. Evidence that a selection 
program already in use is not as good as a new one proposed is not con¬ 
vincing if we use only the uncorrected validity coefficients. It is neces¬ 
sary to use validity coefficients that have been adjusted for the effects 
of selection. Some of the appropriate equations were presented in 
Chapter 11; the others will be given here. 

3. Basic assumptions for univariate selection in the three- 

variable case 

Let us designate these three variables as X } F, and Z and find out 
how selection on the basis of variable X will affect the correlation 
between and the variance of Y and Z. Since selection was on the basis 
of test X, the regression of Y on X and also the regression of Z on X 
will not be altered. In order to state this, we shall designate one group 
by capital letters, the other group by lower-case letters, and write 


( 1 ) 


( 2 ) 
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Also, as before, it is reasonable to assume that the variance of Y about 
the regression of Y on X will be about the same for both groups. Simi¬ 
larly, the variance of Z about the regression of Z on X will be about the 
same for both groups. Again using capital letters for one group and 
small letters for the other, we can write 

( 3 ) s 2 ( 1 - r x 2 ) = S y 2 ( 1 - Rxy 2 ), 

(4) s 2 (l - r x 2 ) = Sz 2 ( 1 - Rxz 2 ). 

In addition to the foregoing assumptions, which are identical with 
those that were used in the bivariate case, it is also necessary to make 
one other assumption for the three-variable case. This assumption is 
that the correlation between Y and Z for a constant X is the same for 
both groups. It can be seen that holding X constant by the statistical 
device of partial correlation should give about the same results, regard¬ 
less of whether or not there is selection on X. Holding X constant is 
the most extreme form of selection possible so that the resulting partial 
correlation between Y and Z will be about the same both for the entire 
group and the curtailed group. Using the conventional notation, we 
may write 

Equation (5) 

Tyz — T xy r xz Ryz ~ RxyRxZ 

rv ‘- x ~ V(1 - = ~ V(1- W)(1'-*!?>■ 

All the assumptions necessary for the three-variable problem in 
univariate selection are given in equations 1 to 5. It should be noted 
that the equations are perfectly symmetrical so that any solution 
obtained for estimating a correlation in a group with a larger variance 
will apply equally well for estimating the correlation in a group with a 
smaller variance. Therefore, instead of saying that the lower-case 
letters stand for the restricted group and the upper-case letters for the 
unrestricted group, we shall adopt the convention that the lower-case 
letters designate values for the group on which complete information is 
available. For one of the groups involved it is necessary to have the 
complete information, consisting of three correlations ( r xyt r xt1 and r yg ) 
and three standard deviations ($ x , and s z ). We shall use the lower¬ 
case letters to designate this group, regardless of whether it is the re¬ 
stricted or the unrestricted group. 

The upper-case letters will be used to designate the group for which 
only one standard deviation is known. The equations developed will 
hold regardless of whether this group is the restricted or the unrestricted 
group. We shall consider first the case where only the standard devia- 
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tion of the explicit selection variable is known for the second group 
(only Sx is known). We shall consider second the case where only the 
standard deviation of one of the incidental selection variables is known 
for the second group (only Sy is known). 


4. Variance of both groups known for the explicit selection 
variable 

Let us proceed as before to solve equation 1 for Rxy, obtaining 

tr\ p SySx 

(f>) Rxy = r xy - - 

SyS x 

Next this value for Rxy is substituted in equation 3, giving 

(7) S„ 2 (l - rxy 2 ) - Sr 2 (1 - r Xtf 2 • 

Multiplying the right side through by Sy 2 in order to remove paren¬ 
theses gives 

. Sx 2 

(8) s y 2 (l - r xy 2 ) = Sy 2 ~ r xy \ 2 —• 

s* 

This equation can readily be solved for Sy 2 : 

(9) S Y 2 = s u 2 \ 1 - r X y + r xy 2 ^-1 • 

L s x j 


This value for Sy is then substituted in equation 6, IIx being substituted 
for the ratio Sx/S Xy giving the following value for Rxy- 


( 10 ) 


Rxy 


_ r xy H x _ 

V1 - r x 2 + r X y Z Hx 2 


It will be noted by way of check that these values of Sy and Rxy are 
the same as the results previously obtained in Chapter 11 for Sy and 
Rxy on the basis of the assumption of selection on X. Similarly, we 
may write by analogy the results for Sz and Rxz in the present case. 
The solutions already obtained for Sy and Rxy will give Sz and Rxz 
if Z is substituted for Y as follows: 1 


(ID 

( 12 ) 


Sz 2 - s 2 [l - rj + rjHx 2 ], 


Rxz = 


_ r xiH x _ 

V1 - T xx + r xt 2 Hx 2 


1 Students who do not yet feel at home with this device of substituting subscripts 
should solve independently for Sz and Rxz by following the general plan of equations 
6 to 10. It will be seen that equations 11 and 12 must be the result of this procedure. 
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The only remaining task is to solve for lRyz- It involves solving 
equation 5 for Ryz and substituting the known values of Rxy and Rxz- 
A rather simple algebraic routine for doing this is to note from equations. 
3 and 4 that we may write 

(13) Vi - B X ? - V(1 - r xv 2 ) 

and 

( 14 ) Vl - R xz 2 = ~ VJV^Vz*). 

Sz 


These values may then be substituted for the denominator of the 
right-hand term in equation 5, giving 


(15) 


f'xi/x 


(Ryz — RxyRxz)SySz 


vT 


VT^r 


V i — TV 2, V l — 




Solving for Ryz y we have 
(16) 


( r j/af ^xy r xz) 8 y s z 

Ryz =--r RxyRxz- 


SySz 

From equations 1 and 2 we can write 


(17) 


RxyRxz = r xy r x 


SyS z Sx 2 

s x 2 SySz 


Substituting equation 17 in equation 16 and factoring out SyS g /SySz , 
we have 

SyS z [ Sx 2 l 

(18) Ryz == ~ ~ r yz ?xy r xz ” 1 “ T X yTxz , 

SySz l s/ J 


which expresses Ryz in terms of given quantities and of the two values 
(Sy and Sz) for which solutions have already been obtained. Substi¬ 
tuting equations 9 and 11 in equation 18 and simplifying, we have 



where Ryz is the correlation between two incidental selection variables 
for one group, and 

Sx is the standard deviation of the explicit selection variable in 
the same group. 
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For the other group involved, complete information is available as 
follows: 

r yz is the correlation between two incidental selection variables, 

r xy and r xz are the correlation of the explicit selection variable with 
each of the incidental selection variables, 
s x is the standard deviation of the explicit selection variable, 
and 

s y and s g are the standard deviations of the two incidental selection 
variables. 

If there is explicit selection on variable x, the variance of x , 
and three correlations (r xyy r xz , and r yz ) are available for one 
group, and only the variance of X (Sx 2 ) is available for a 
second group , then the correlation of the two incidental selec¬ 
tion variables (Ryz) for the second group may be estimated 
by equation 19. 

It should be noted that where the variance of the explicit selection 
variable is known for both groups, it is necessary to present only the 
equation for the correlation between the two incidental selection vari¬ 
ables. All other values (the variance of the two incidental selection 
variables and their correlation with the explicit selection variable) are 
given by the bivariate equations 18 and 20, presented in Chapter 11. 

Equation 19, or slight variants of it, has been given by Pearson 
(1903a) and Thorndike (1947). 

5. Variance of both groups known for one of the incidental 
selection variables 

We now turn to the final case for univariate selection with three 
variables. Again we designate the explicit selection variable as x and 
assume that the standard deviation for both groups is known on one 
of the other variables. 

Since both Y and Z are variables that have been subject only to inci¬ 
dental selection, it makes no difference whether Sy is assumed known 
and Sz unknown, or vice versa. We shall assume Sy known, and solve 
for Sz. The equations derived can be applied generally by designating 
as Y the variable subject to incidental selection for which both variances 
are known. The incidental selection variable for which only one vari¬ 
ance is known will be designated as variable Z . 

Without loss of generality then, we may say that selection is on 
yariable X and that the standard deviations of both groups are available 
on variable Y. For this case the known values are r xy , r XZf r yZ} s x , s Vf 
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Sg, and Sy- Knowing these values enables us immediately to solve for 
Rxy 2 by equation 3, giving 

(20) Rxy 2 = 1 - (^) d - O, 

which is the same as equation 8 of Chapter 11. Rearranging the terms 
in equation 1, we find that 


( 21 ) 


Sx _ RxySy 

S X T X ySy 


Substituting the value for Rxy found in equation 20 and solving for 
Sx, we have 


( 22 ) 


Sx = Sx 


VS Y 2 ~ s y 2 ( 1 - r xy 2 ) 


SyTxy 


Equation 22 is essentially the same as equation 12 of Chapter 11. 

Next let us solve for Sz • One way of doing this is to write equation 2 
explicitly for Rxz *. 


(23) 


Rxz = r xz 


SgSx 

SxSz 


This value for Rxz is then substituted in equation 4, giving 
(24) s, 2 (l - rj) = Sz 2 [l - • 


Removing the brackets on the right-hand side of equation 24, we obtain 
(25) s 2 (l - r xl 2 ) = Sz 2 - r x 2 


Solving this equation for Sz 2 , we have 

(2G) Sz 2 = s* 2 [ 1 - r x 2 + r x 2 j • 


Substituting equation 22 in equation 26 and simplifying gives 


(27) 



2 . r 2 

z T • xz 


Sy 2 — s„ 2 (l — T xy 2 ) 

s y 2 r xu 2 


Expanding and simplifying, we have 


2 


s 2 r x y 2 - s y 2 r xt 2 + S Y 2 r x 2 
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where Sz 2 is the unknown variance of Z, expressed in terms of the 

known quantities, 

Sy 2 is the variance of one of the incidental selection variables 
for the second group, 
s y 2 is the variance of y for the first group, 
s 2 is the variance of z for the first group, and 
r xy and r xz are the correlation of the explicit selection variable with 
each of the other variables for the first group. 


If selection is explicitly based on variable x, complete infor¬ 
mation is available for one group , and the variance of one of 
the incidental selection variables (y) is known for another 
group, then equation 28 may be used to estimate the vari¬ 
ance of the other incidental selection variable (Z) for the sec¬ 
ond group. 


In order to obtain a value for Rxz, let us use equation 4 and solve 
for Rxz 2 , obtaining 

s* 2 (l - r xl 2 ) 


(29) 


R 


xz 


2 — 


l - 


& 2 


Substituting the value of Sz 2 from equation 28, we obtain 

2 _ , Sy 2 r X y 2 (l - r x 2 ) 


(30) 


Rxz = 1 


Sy 2 r x 2 - s 2 r x 2 + s 2 r x 


This solution expresses Rxz entirely in terms of known quantities. We 
may put this in another form by multiplying and simplifying, obtaining 


(31) 


p = / ~ s ±_ + 

xz - r xt yj 8 ^ 2 _ . , + s 2 r J 


This form, it will be noted, could also have been readily obtained by 
substituting equations 22 and 28 in equation 23. 

In equation 31 Rxz is the unknown, the correlation between the 
explicit selection variable and the other incidental selection variable, 
and all other terms have the same definition as in equation 28. 


If selection is explicitly based on variable x } complete infor¬ 
mation is available for one group, and the variance of one 
of the incidental selection variables (y) is known for another 
group y then equation 81 should be used to estimate the cor¬ 
relation between the explicit selection variable and the other 
incidental selection variable for the second group. 
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Let us now turn to the problem of solving for Rvz. First let us note 
that from equations 3 and 4 we may write 

V^l r X y 2 )(l r xz 2 )(syS z ) 


(32) V(1 - Rxy 2 )(1 - Rxz 2 ) = 


S Y Sz 


Substituting this value in equation 5 gives 

r yz ~~ Txyfxz (RyZ — RxyRxz)SySz 


(33) 


VCl - f xy 2 ){\ - r** 2 ) V(1 - »V)( 1 - r« 2 ) 

Solving this equation for Ryz, we have 


(34) 


Ryz = 


SyS Z 

SySz 


(?yz r xy r xz) 


+ RxyRxz- 


4 


From equations 1 and 2 we obtain the following expression for RxyRxz * 

SySzSx 2 


(35) 


RxyRxz = 


s x 2 SySz 


If this value for the product RxyRxz is substituted in equation 34 and 
the common terms (s v s z /SySz) factored out, we have 

SySz T $A” 2 1 

(36) Ryz = zr-zr r v* ~ r *v r ** + ~T ' 

oyoz L s* J 


This equation expresses Ryz in terms of known quantities, and the values 
Sx and Sz for which equations have already been given. If we substitute 
equations 22 and 28 in equation 36 and simplify, we get 

D __ r xz(^Y 2 ~~ S V~) ~h r xy r yz$y 
(37) Ryz ~ Sy Vr *, 2 ^ 2 - O + s v 2 r X y 2 ' 

where r yz is the correlation between the two incidental selection variables 
for the first group, 

Ryz is the correlation between the two incidental selection variables 
for the second group, and the other terms have the same 
definitions as for equation 28. 


If selection is explicitly based on variable x, complete infor¬ 
mation is available for one group, and the variance of one of 
the incidental selection variables (y) is known for another 
group , then equation 87 should be used to estimate the cor¬ 
relation between the two incidental selection variables for the 
second group. 



154 


The Theory of Mental Tests [Chap. 12 

This completes the consideration of the second case considered under 
the three-variable problem, the case in which selection was on one vari¬ 
able (X)y and the standard deviation of the variable that was subject 
to incidental selection was known for both groups. This variable was 
designated Y , and equations were derived to express Sx, Sz , Rxz, and 
Ryz, and Rxy in terms of the known quantities r xz , r yz , r xyi s x , s y , s z , 
and Sy. The equations for Sx and Rxy were similar to those derived 
for the corresponding bivariate case in Chapter 11. However, the 
equations (involving variable Z) for Sz , Ryz > and Rxz were different 
from those previously derived for the bivariate case. 

One general caution must be noted in the application of the selection 
equations presented in Chapters 11 and 12. They are applicable only 
when the selection is made on the basis of one variable, and is made in 
such a way as not to alter the regression line of other variables on the 
explicit selection variable or the error made in estimating other variables 
from the explicit selection variable. If these assumptions hold, then 
the equations are perfect. However, as noted at the beginning of 
Chapter 11, in most practical situations the equations are likely not 
to apply. 

We should inspect the frequency distribution of the uncurtailed dis¬ 
tribution and of the curtailed distribution for the variables involved 
before deciding upon the equations to use. If there is a sharp cut-off 
point, and every person above was accepted while every one below was 
rejected, it is clear that the variable in question may be regarded as the 
explicit selection variable, and the equations of Chapters 11 and 12 
used with confidence. If there is not a sharp cut-off point or a reasonably 
sharp cut-off point on one variable, the exact selection procedure must 
be more carefully investigated to determine if the type of selection used 
can reasonably be assumed not to have altered certain of the errors of 
estimate and regression slopes involved. If we can justify the assump¬ 
tions indicated in equations 1 to 5, the selection equations apply. If, 
after study, we feel that the assumptions of equations 1 to 5 do not 
apply, there is no way. of estimating the probable effects of selec¬ 
tion. 

As mentioned in Chapter 11, there is no adequate substitute for a 
well-designed experiment that makes the proper comparisons without 
any selection procedure. If we must work in a practical situation where 
selection is essential, every effort must be made to have the selection 
proceed in a specified manner, using definite critical score points, and 
rejecting every one below and accepting every one above these points. 
If such procedures are instituted and actually followed, the selection 
formulas can be used. If the selection procedure varies in accordance 
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with “numerous practical considerations,” it is impossible to estimate 
the effect of selection. 

6. Summary 

These are the basic assumptions for the three-variable case. 

1. The slopes of the regressions of the incidental on the explicit 
selection variable are not altered by selection. This assumption is 
given in equations 



2. The error made in estimating either of the incidental selection 
variables from the explicit selection variable is not altered by selection. 
This assumption is given in equations 

(3) «* 2 (1 - r xy 2 ) = Sy 2 ( 1 - R xy 2 ) 

and 

(4) s 2 ( 1 — r xz ) = jSVO ” Rxz 2 )- 

3. The partial correlation between the two incidental selection vari¬ 
ables is not altered by selection. This assumption is given in equation 

^ r yz ~ r xy r £Z _ R YZ — RxyRxz 

V7T~- r xy 2 ){[ — r xi 2 ) ~ V(1 - Rxy 2 ) (l - Rxz 2 )' 

It was assumed that we were always given complete information on 
one of the groups, and the convention that this group was represented 
by lower-case letters was adopted. That is, it was assumed that r xy} 
r xz , r yz , s x , s y , and s z were always known. Unless complete information 
is available on one group, it is not possible to solve the problem. Two 
cases were then considered. 

1. The case was considered in which the standard deviation of the 
explicit selection variable (Sx) is known for the second group. The 
value of Sy was given in equation 9, and the value of Sz in equation 11. 
These equations are not repeated here because they are formally iden¬ 
tical with equation 20 of Chapter 11 (on bivariate selection). Similarly, 
the value of Rxy was given in equation 10, and of Rxz in equation 12. 
These equations are formally identical with equation 18, Chapter 11. 
The only new problem posed by the three-variable problem for this first 
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case is estimating the correlation between the two incidental selection 
variables for the second group. This value is given by equation 


(19) Ryz 


r yz 


TxyTxz 4” T X yTi 


Sx 2 
' S x 2 - 


^ If, 2 i 2^1 

4" r xy o ^ ^ xz 4” ^ xz 2 

J L ** J 


2. The case was also considered in which the standard deviation of 
one of the incidental selection variables was known for both groups. 
Without loss of generality, we may designate the known standard devia¬ 
tion Sy and use Sz for the unknown standard deviation of the other 
selection variable. The formula for Sx is given in equation 22, which 
is not repeated since it is identical with equation 12, Chapter 11. The 
variance of the other incidental selection variable (Sz 2 ) is given by 
equation 


(28) 
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The value of Rxv is given in equation 20, which is identical with equa¬ 
tion 8, Chapter 11, and is not repeated here. The correlation Rxz is 
given by equation 


(31) 


Rxz = r.r 
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The correlation between the two incidental selection variables is given 
by equation 


(37) 
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These equations (31 and 37) have no counterpart in the bivariate case. 


Problems 

1. Assume that we are dealing with a case in which explicit selection occurs on the 
criterion, and the test is subject only to incidental selection (such as would occur if 
everyone taking the entrance test were admitted, but only those who received “pass¬ 
ing” scores on the criterion were included in the selected group). 

(a) Write the equation showing the relationship between the entrance test vari¬ 
ances and validities for the curtailed and the complete group. 

(b) Write the equation showing the relationship between the entrance test vari¬ 
ances and reliabilities for both groups. 
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2. Compare, from information in previous chapters and problems, the relationship 
between reliability and validity: 

(а) When the change is due to alteration of test length. 

(б) When the change is due to explicit curtailment of test variability. 

(c) When the change is due to explicit selection on the criterion resulting in inci¬ 
dental selection on the test in question. 

3. Test X is given to a group of applicants for admission to a certain college. The 
mean score for all applicants is 150, and the standard deviation is 30. After the 
students have been selected on the basis of this test and admitted to college, test F 
is given. The two tests, X and F, are correlated with grade average, which is used 
as a criterion, with the following results: 

Test X; mean 170 standard deviation 20; validity .63. 

Test F; mean 160 standard deviation 25; validity .68. 

The correlation of X and F is .80. 

According to these data, which test would be better to use for college admission? 

4. Prove that, if r yi =* r xz and Sx < s x , Ryz > Rxz- 

5. Prove that, if r yz =» r zz and Sx > s X} Ryz < Rxz- 

6 . Study this reference: G. G. Thompson and S. L. Witryol (1946), “The relation¬ 
ship between intelligence and motor learning ability as measured by a high relief 
finger maze,” J. Psychol ., 22, pages 237-246. 

(a) Comment on the use of the correction for restriction of range. 

(b) Present a correction tor homogeneity which you consider appropriate for these 
data. 

(c) Give both the calculations and the argument for the correction you select. 
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Correction for Multivariate Selection 
in the General Case 

1. Basic definitions and assumptions 

The equations for multivariate selection in the general case become 
almost prohibitively complex unless matrix algebra is used. Since only 
a few theorems of matrix algebra are used in this derivation, these 
theorems will be summarized here. Any set of numbers arranged in 
rows and columns is termed a matrix, and is designated by a single 
letter, such as M, N, A, B. In the derivations of this chapter four basic 
matrices are necessary. We have the matrix of test scores for the 
variables subject to explicit selection. For N individuals and A tests 
we may define 

A n ... X 1A 

( 1 ) *na = ; • 

Xni * ‘ * Xna 

The X’s on the right-hand side of this equation are defined as deviation 
scores to simplify the formulas for variances and covariances. In defin¬ 
ing the score matrix we may let each individual represent a row and each 
test a column, or vice versa. In the score matrices used here we shall 
arbitrarily let each row represent an individual, and each column a test. 
The matrix of test scores for the variables subject to incidental selection 
is defined by 

Y n ••• Y 1B 

(2) Y NB = ' [ 

Yni ••• Ynb 

for N individuals and B tests. Again the F’s on the right-hand side of 
the equation designate deviation scores. The X’a are regarded as 
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independent variables, and the Fs as dependent variables, which may 
be estimated by a weighted sum of the X’s. Let us use W XgYb to desig¬ 
nate the weight to be applied to X g to predict F fe . The complete matrix 
of weights will be defined by 

W Xi y x ••• W Xi y b 

(3) Wat = 

W Xa Yi W XaYb 

The first column contains the weights to be applied to the independent 
variables X x to X A to predict Yi. In general any column (which may 
be designated 6) gives the weights to apply to the independent variables* 
Xi to Xa to predict F&. If the predicted Yb is indicated by 7$, we have 

fib = W XlYb Xn + W Xl y„X i2 + • • • + W XA y„X iA (6 - 1 • • • B). 

N 

The weights are to be chosen so that 2 Wib — ) 2 is a minimum. 

It is also necessary to introduce a diagonal matrix with the terms along 
the principal diagonal, each equal to 1/AT, and all other terms equal to 
zero. Thus we have the square matrix 

1/NO ••• 0 0 

0 \/N • • • 0 0 

(4) D go = * 

0 0 ••• \/N 0 

0 0 ••• 0 l/N 

where the subscript G designates the number of rows (columns) in the 
matrix and may equal either A or B. 

It should be noted that D is a scalar; hence for any two matrices 
P and Q for which PQ exists, 

DPQ = PDQ = PQD. 

Persons not acquainted with matrix algebra will need to study a 
text—the first four chapters of Bocher’s Higher Algebra (1907), for 
instance, or the first chapter of Thurstone (1935a) or (19476). Those 
who have not worked recently with matrix algebra may find the following 
ten principles helpful in studying the derivations given here. 
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The matrix sum M + N exists only if the number of rows in M 
is equal to the number of rows in N, and the number of columns 
in M is equal to the number of columns in N. 

Matrix addition follows the associative and the commutative laws. 

M + (N + O) = (M + N) + O 


M + N - N + M. 

The matrix product MN exists only if the number of columns in 
M is equal to the number of rows in N. Matrix multiplication is 
associative but not commutative. 

M(NO) = (MN)O, 

MN NM. 

Matrix multiplication satisfies the distributive law for both pre¬ 
multiplication and postmultiplication. 

M(N + O) = MN + MO, and 

(N + 0)M = NM + OM. 

Any square matrix M with a non-vanishing determinant 
(|m| ^ 0) has an inverse, M —1 . A matrix premultiplied or post- 
multiplied by its inverse gives the identity matrix, I. 

M -1 M = MM -1 = I. 

The inverse of a transpose is equal to the transpose of the inverse. 
(M -1 )' « (M') _1 . 

The transpose of a product is the product of transposes taken in 
reverse order. 

(MNO)' = O'N'M'. 

The transpose of a sum is the sum of the transposes. 

‘ (M + N)' = M' + N'. 

The inverse of a product is the product of the inverses taken in 
reverse order. 

(MNO)" 1 = O^N-'M- 1 . 

The inverse of a sum (M + N) -1 cannot in general be simplified. 


The solutions given in this chapter will be stated in terms of matrices 
of variances and covariances so that no explicit expressions for the 
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correlation matrices will be needed. The variance-covariance matrix 
for the explicit selection variables, designated as Cxx, is given by 

(5) Cxx = X'avXa^Da^. 

In like manner, we may write C yy, the variance-covariance matrix for 
the variables subject to incidental selection, as 

( 6 ) C yy = Y'baYnbVbb- 

The (XY) covariance matrix is designated by Cxy and written 

( 7 ) C X y = X’anYnbDbb- 

The variables subject to incidental selection (F-variables) are regarded 
as being estimated by linear combinations of the explicit selection 
variables (X-variables). We designate the matrix of predicted Y-values 
by Y hb and write 

( 8 ) Ya b = X AA W AF . 

The matrix of errors of prediction is given by the difference between the 
actual Y-values and the predicted Y-values. This matrix is designated 
E and written 

(9) E = Yat* — Y kb- 

The variance-covariance matrix of the errors of prediction, designated 
C eej is given by 

(10) C ee = E'EDbi?. 

Equations 1 to 10, written in upper-case letters, will be used to 
designate the group for which complete information is not available on 
all variables. Corresponding equations with lower-case letters will be 
used to designate the group for which complete information is available. 
We have, corresponding to equations 5 to 10, for the group on which 
complete information is available, 

&xx ^ X anX-na&aay 
C yy — y bnY nb&bby 
£xy = ^ attY nb^bby 

(14) Ytib ^ %-naWxyy 

(15) e = y n 6 “ Ynby 
and 

(16) c ( 


(ID 

( 12 ) 

(13) 


e'ed&6- 



162 The Theory of Mental Tests [Chap. 13 

It should be noted that the number of cases in the two groups need 
not be the same so that, in general, n 9 * N, d oa t 6 D aa, and d bb 5 * Dbb- 
However, the number of explicit selection variables must be the same 
in both groups, and the number of incidental selection variables must 
be the same, so that a = A and b = B. 

The group designated by upper-case letters is assumed to be similar 
to the one designated by lower-case letters in that the gross score weights 
applied to the explicit selection variables to predict the incidental selec¬ 
tion variables are the same for both groups. This assumption is the 
generalization of the assumption of equal slopes given in equations 1 
and 2 of Chapter 12 for the three-variable case in univariate selection. 
In matrix notation this assumption is written 

(17) W xy — w xi/ . 

In addition, it is assumed that the errors of estimate are the same for 
both groups. (See equations 3 and 4, Chapter 12.) It is also assumed, 
as in equation 5 of Chapter 12, that the correlations among the Y ’s when 
the X’s are partialled out is the same as the correlation among the y’ s 
when the x’s are partialled out. These two assumptions are written 
in the matrix equation 

(18) C ee = c ee . 

The diagonal terms of these matrices are the squares of the errors of 
estimate; assuming them equal corresponds to the generalization of 
assumptions of equations 3 and 4 of Chapter 12. The non-diagonal 
terms of equation 18 are the partial covariances. Assuming them equal 
corresponds to the generalization of equation 5, Chapter 12. 

Since the basic assumptions given in equations 17 and 18 involve 
W xy and C EE) we shall turn to the problem of expressing these matrices 
in terms of the basic variance-covariance matrices Cxx, C yy, and C xy- 
The error made in predicting any given set of 7-values, such as 
(where i = 1 • • • N, and b is some specified value that may be any one 
of the values from 1 to B), is indicated by subtracting the summed 

A. 

product (]S£ WgbXi g ) from squaring these differences, summing over 

i for a given value of 6, and dividing by N. This gives an error variance 
term that is one of the diagonal terms of the matrix C ee- In order to 
deal solely with these diagonal terms, equations 19 to 21, inclusive, are 
in the usual algebraic summation notation; equation 22 returns to matrix 
notation. Let us designate a typical diagonal term by E* (where 
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5 = 1... B). We then have 

(19) E b 2 = £ (Y ib - £ W gb X ig ) 2 i (b - 1> • -B). 

1—1 £-1 iv 

The multiple correlation problem is to select the weights {W gb ) so as to 
minimize the value of the error variance E b 2 . For a given value of 
E 2 , we differentiate with respect to Whb (h = 1 • • • A) and set the 
derivative equal to zero, obtaining 

(20) ~ = |:£ (Y ib - £ W f6 X„)X a = 0. 

iV i== i g==1 

Removing parentheses and changing the order of summation in the* 
second term, we have 

dEh 2 2 T N a -v n 

'( 21 ) — - - E Y ib x lh = 0 . 

dWhb N L i = i gmm I 1=1 J 

For a single value of A, equation 21 states a single condition for mini¬ 
mizing a given term in the diagonal of C ee* If we let h take in turn 
each of the values from 1 to A, while b remains fixed, equation 21 
indicates a set of A equations which specifies the weights necessary to 
minimize a given term (E 2 ) in the diagonal of C ee- If b now takes in 
turn each of the values from 1 to B equation 21 indicates a set of AB 
equations which specifies the weights necessary to minimize in turn 
each of the terms {E 2 ) (b = 1 • • • B) in the diagonal of C ee- When 
h = 1 • • • A and b = 1 * • • B, the first term of equation 21 is identical 
with the matrix given in equation 7. The last term is in the general 
case identical with the product of equations 5 and 3, or CxjtW xy- 
Putting equations 21 into matrix notation, we have 

(22) C xy — CxxWxf = 0 . 

From equation 22 we obtain the solution for the matrix of weights 
(Wxr). Transferring CxxW xy to the other side of the equation, and 
premultiplying both sides by the inverse of Cxx , we have 

(23) = C^ 1 xxCxx^xy = W xy- 

A corresponding equation can be derived for the group on which 
complete information is available. Substituting lower-case letters in 
equation 23 gives this equation, 
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Since the transpose of both equations 23 and 24 will also be needed, we 
shall write these explicitly as 

(25) W *rx = C'yxC"" 1 ** 
and 

(26) w' vx = c' vx c~\ x . 

Since Cxx Is a symmetric matrix, its inverse is also symmetric; hence 
(C~~ l xxY = C~~ l xx, and correspondingly, since c xx is symmetric, 
(c- 1 ^ - c- 1 *,. 

Equations 23 and 24 or equations 25 and 26 give the best weights to 
use for the X’s (or for the z’s) to predict the F’s (y’ s). These equations 
are identical with the equations for the best weights used in multiple 
correlation. (See Chapter 20, equations 52 and 53.) 

We now turn to the evaluation of the variance-covariance matrix of 
the errors of prediction (C^) in terms of the matrices Cxx, Cyy, and 
C xy, which can be obtained from the data on observed scores. Substi¬ 
tuting equation 9 in equation 10 gives 

(27) C be — (Ynb — Y^)'(Ya^ — Ynb)Dbb- 
Removing parentheses and expanding, we have 

(28) C ee — Y'bnYjvbDbb — Y'bnYnb^bb 

— Y'bnYnb^bb + Y'bnYnbDbb- 
Substituting equations 6 and 8 in equation 28, we have 

(29) C ee = C yy — Y'batXatxW xy^bb ~ W'yxX' anYnb'Obb 

+ WV^X 7 AN%NaWXY&BB- 

Substituting equations 5, 7, 23, and 25 in equation 29, and noting that 
D is a scalar, we obtain 

(30) C ee 3=8 C yy — C'yxC~~ 1 xxCxy — CVxC^xxCxy 

+ C'yxC~~ 1 xxCxxC~ 1 xxCxy- 

Since a matrix times its inverse is the identity matrix, the last two terms 
cancel each other, leaving 

(31) Cee = C yy — C'yxC^xxCxy- 

Equation 31 may be written in another form by substituting equation 23 
in it. Alternatively, equation 25 may be substituted in equation 31. 
Making these substitutions gives 

(32) C ee = C yy — C'yxWxr = Cry — YJ'yxCxy- 
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A corresponding set of equations can be derived for the group on which 
complete information is available. Substituting lower-case letters in 
equations 31 and 32 gives these equations as 

(33) = C yy C^ya-C 1 xx^xy 

and 

(34) C e e s® C yy C yxW X y ^ C yy W y X C X y 

2. Complete information available for all the explicit 
selection variables 

We now consider the case in which the variance-covariance matrix 
of both groups is known for the variables subject to explicit selection.^ 
In this case Cxx is given. In general it is assumed that c xx , c yy , and 
c x V are known. The problem of this section, then, is to express the 
unknown variance-covariance matrix for the incidental selection vari¬ 
ables C yy and the covariance matrix C xy as functions of the known 
terms Cxx, c XXy c yy , and c X|/ . Substituting equations 23 and 24 in 
equation 17 gives 

(35) C~ l xxC xy = C~ l xxCxy 

Premultiplying both sides by Cxx and noting that a matrix times its 
inverse is equal to the identity matrix, we have 

(36) C xy = CxxC~ l xx Cxy 

Using equation 24, we may write equation 36 in the alternative form, 

(37) C xy = Cx*w X y 

Since the transpose of Cxr will be needed in solving for C yy, we may 
write it from equations 36 and 37 as 

(38) C f yx = c ^xj&xx “ w yx Cxx- 

Equation 36 or 37 gives Cxr entirely in terms of known quan¬ 
tities when complete information is available for the explicit 
selection variable ( X ). 

This solution for Cxr in terms of the explicit selection variables is given 
by Pearson (1903a), Aitken (1934), and Burt (1943) and (1944). 

To solve for C yy, substitute equations 32 and 34 in equation 18, 
obtaining 

(39) Cyy — C , y.yWxF ** Cyy C yx^xy 
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Using equation 17 and solving explicitly for C yy, we obtain 

(40) C YY = C yy + (C 'yx — c f yx )w xy . 

Substituting equations 24 and 38 in equation 40, we have 

(41) C YY “ Gyy + C f yx C ^xx^XXC ^xx^xy C f yx C ^xafixy 

Since C yy is symmetric, it is equal to its transpose. Thus, from equation 
40, we have an alternative form, 

(42) C YY 888 Vyy 4" W f yx(f*XY C xy )• 

This equation is also given by Pearson (1903a), Aitken (1934), and 
Burt (1943) and (1944). 

Equation 41 gives C yy entirely in terms of known quantities . 
Equation 40 or 1$ gives C yy if C xy is taken from equation 
86 or 88. 

These equations complete the solution for the case in which complete 
information is available for the variables subject to explicit selection. 
Equations 36, 37, and 38 are generalizations of equation 18, Chapter 11, 
and equations 10 and 12 of Chapter 12. Equations 40, 41, and 42 are 
generalizations of equation 20, Chapter 11, and equations 9 and 11 of 
Chapter 12. 

3. Complete information available for some of the incidental 
selection variables 

For the generalized treatment of this case, it is necessary to distinguish 
between two categories of variables subject to incidental selection. We 
will let Y (or y) designate only those incidental selection variables for 
which complete information is available. That is, C yy is known in 
addition to c yy . Incidental selection variables for which incomplete 
information is available will be designated by Z (or z). For these only 
c zz is known. It thus becomes necessary to express five unknowns, 
Cxx, Cxy, C zz, Cxz, .and C Y z in terms of the known quantities Cyy, 

^Mj Cxx, G>zzj C X y, Cxz, and C yz . 

The solution for all terms involving Z will be postponed. Let us 
first consider the solution for Cxx and C*r in terms of the known 
quantities c yy , c xx , c xy , and c yy- 
Substituting equations 32, 34, and 17 in equation 18, we have 

(43) Cyy — C'YxW xy = c vy — c' yx w xy . 

Transferring the known terms to one side of the equation, we have 

(44) C'yxWxv = Cyy “ c yy + c' yx w xy . 
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Postmultiplying both sides by gives the solution for the transpose 
of C xy as 

(45) C'yx = (C YY “ Cyy)W ~ l yx + c' yx . 

Taking the transpose of both sides gives 

,(46) C xy = vr'~ l xy(C Y Y — c yy ) + c xy . 

Equation Ifi gives C xy in terms of known quantities , if inf or- 
motion on both groups is available for the incidental selection 
variables (Y). 

Equation 46 is a generalization of the solution for Rxy given in equation 
8 , Chapter 11, and equation 20, Chapter 12. 

It should be noticed that this solution assumes that the inverse of 
vr xy exists. Since the inverse of a product is the product of the inverses 
taken in the reverse order, equation 24 gives 

(47) W *y X = C *yx&XX' 

That is, w xy has an inverse if c zy has an inverse. In other words, the 
variables y for which complete information is available must be at least 
equal in number to the explicit selection variables, and must not be 
linearly dependent on the explicit selection variables (x). If these 
two conditions are met, c"" 1 ^ will exist, w"" 1 ^ will exist, and the solution 
given by equation 46 will be meaningful. If the number of incidental 
selection variables for which complete information is available is less 
than the number of explicit selection variables, or if these incidental 
selection variables are linearly dependent on the explicit selection 
variables, the information available is not sufficient for an exact solution 
of the problem. Given such conditions, various sets of values for Cxy 
would be possible solutions. If the number of incidental selection 
variables for which complete information is available is greater than the 
number of explicit selection variables, Cxy is overdetermined. The 
solution for Cxy would be in terms of least squares or some other maxi¬ 
mum likelihood procedure. It would also be desirable to devise some 
method for assessing variation of the individual solutions from the least 
squares solution. If this variation is small, the least squares solution 
could be accepted. If this variation is large, it probably indicates that 
unknown factors are entering into the selection procedures so that a 
further effort must be made to secure a more accurate description of 
selection procedures before proceeding with any corrections for selection. 

In this book we shall consider only the case in which adequate in¬ 
formation is available, and the solution is exact. That is, the number 
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of incidental selection variables for which complete information is avail¬ 
able ( Y ) is equal to and not linearly dependent on the explicit selection 
variables. In this case equation 46 is the solution for Cxy. 

The solution for Cxx may be obtained by postmultiplying both sides 
of equation 37 by w -1 vx , obtaining 

(48) Cxx — Cxvw -1 vx . 

Using equations 46 and 24, we may write Cxx as 

(49) Cxx = w' -1 x „(Cyy - C„„)W -1 „ X + C xx . 

Equation gives Cxx in terms of known quantities when 
complete information is available for the incidental selection 
variables (F). 

It should be noticed that the solution for Cxx also involves only the 
assumption that the inverse of w xv or of c x „ exists. Equation 48 or 49 
is a generalization of the solution for Sx given in equation 12, Chapter 11, 
and in equation 22, Chapter 12. 

Thus we have Cxx and Cxy in terms of known quantities. The 
equations of the preceding section can then be used to give C zz, C xz, 
and C yz- Substituting Z for F and z for y in equation 37, we have 

(50) Cxz = Cxxw xt . 

Substituting equation 49 in equation 50, we have 

(51) CXZ = ^ 'jry(Cyy' „ X W XX -j- C XX W XX . 

Using equations 24 and 47, we have 

(52) Cxz = W ,_1 xw (CyK - c uv )c~\ x c xz + c x# . 

Equation 52 gives the solution for Cxz in terms of known 
quantities when complete information is available for the in¬ 
cidental selection variables (Y). 

Equation 52 is a generalization of equation 31 of Chapter 12. 

In a corresponding manner we may write the solution for C zz by 
substituting Z for Y and z for y in equation 42, obtaining 

(53) Czz = G zz 4“ W^a;(C xz 
Substituting equation 52 in equation 53, we have 

(54) C ZZ * ^zz 4" 1 *y(Cyy — C yy )C yx^xz* 
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Using equation 26 and the rule that the inverse of a product is the 
product of the inverses in reverse order, we may write equation 54 in 
an alternative form as 

(55) ^ZZ ^zz + C zx C xy(f*YY ^yy )C ^yx^xz* 

Equations 54 and 55 give C zz in terms of known quantities 
when complete information is available on some of the inci¬ 
dental selection variables. 

Equation 55 is the generalization of equation 28, Chapter 12. 

The value of C yz may be found from the assumption that the partial 
correlations between Y and Z (X held constant) are equal to the 
partial correlations between y and z (x held constant). This matrix, 
may be written by substituting Z for the second Y in equations 32, giving 

(56) CevEz = Crz — CVxWjvz = C yz — WVa'Cxz- 
In like manner, using the lower-case letters, we have 

(57) ^e u e t = tyz C yx^xz ~ C yz W* yx C xz . 

If we assume that C Ey e z = c eyfzt use assumption equation 17, and solve 
for C yz from equations 56 and 57, we have 

(58) CyZ == Gyz “t" W yxif^XZ C xa ). 

Substituting equation 52 in equation 58, we have 

(59) c YZ = C„. + - c yu )c-\ s c xlt 

which simplifies to 

(60) C YZ = C yz + (C yy — c l/2/ )c““ 1 J/x c xz . 

Equation 60 gives C yz in terms of known quantities. 

Equation 60 is a generalization of the expression for Ryz given in 
equation 37 of Chapter 12. 

This completes the solution for the general case in which complete 
information is available for some of the variables subject to incidental 
selection. An exact solution is possible only when the number of inci¬ 
dental selection variables for which complete information is available 
is at least equal to the number of explicit selection variables, and when 
there is not complete linear dependence between these incidental selec¬ 
tion variables and the explicit selection variables. The solution has 
been given for the five matrices Cxx> C, zz, Cxz* Cx:r, and C yz in terms 
of the known quantities C yy, c V y } c xx , c zz , c xy , c xz , and c yz . 
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It is also possible to solve a more general case in which complete 
information is assumed to be available for some of the incidental selection 
variables and some of the explicit selection variables. The detailed 
solution for this case will not be given. It can be solved by using the 
methods of this section to solve for the remaining explicit selection 
variables, and then using the methods in this and the preceding section 
to complete the solution. 

4. Summary 

In dealing with the general case of multivariate selection, X (or x) 
was used to designate the variables subject to explicit selection, and 
Y (or y) to designate the variables subject to incidental selection. 
Lower-case letters are used to designate the group for which complete 
information is available. Thus the variance-covariance matrices c xx 
and c yy are known, as well as the covariance matrix c xy . Upper-case 
letters are used to designate the group for which only one variance- 
covariance matrix is known. That is, either Cxx or C yy is known. The 
problem is to solve for the unknown variance-covariance matrix and 
for the covariance matrix Cxy- 

It is assumed that the properties of the regression of Y on X are 
identical with the properties of the regression of y on x. This means 
that the gross score weights are equal for both groups, that is, 

(17) W XY = Wxy 

Since these are the least square weights, equation 17 may be rewritten as 
(35) C~ 1 xxCxy = c~ l xz c xy . 

The assumption of identical properties of the two regressions also means 
that the error made in estimating Y from X is the same as the error 
made in estimating y from x; and that the correlations among the Y ’s 
with the X ’8 partialled out are the same as the correlations among the 
y’a with the x’s partialled out. These assumptions are given in the 
equation 

(18) C EE = C ee - 

Rewriting equation 18 explicitly for the least squares case, we have 

(39) C yy — CVxWjsrr = o yy — c' yx w xy . 

For the case in which complete information is available on all the 
explicit selection variables, Cxx is known. The two unknowns C xy 
and C yy are given by 

(37) C xy = CxxWay, 
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where 
(24) 
and by 
(42) 


Wjy — C ^ xx C xy) 


C YY = C yy + W'y X (CxY ~ ^X V ) - 


For the case in which complete information is available on some of 
the incidental selection variables, it is necessary to distinguish between 
the incidental selection variables for which complete information is 
available (designated by Y or y) and the incidental selection variables 
that are known for only one group (designated by Z or z). The solution 
for the unknowns (C*y, Cxx, C xz, C Z z, and C Y z) can be indicated in 
terms of the known quantities (C yy, c yyt c xx , c zzy c xy , c XZy and c yz ). It 
should be noted that all these solutions are dependent upon the existence 
of the inverse of w xy that is equivalent to the existence of the inverse 
of c xyi since 

(47) W ^ yx = C yx Cxx- 


That is, we are not considering the cases in which the number of y’s 
is greater or less than the number of x’s; nor are we considering the case 
in which the y’s are linearly dependent upon the x’s. For all the follow¬ 
ing solutions, it is assumed that the c xy or w X2/ is a square matrix with its 
rank equal to its order. 

For this case we have 


(46) Cxy = w' 1 xu (C Y y ~ c yy ) + c xy , 

where w '~~ l xy is given as the transpose of equation 47. Using equations 
46 and 47, we may write 

(48) Cxx = CxyW~~ 1 vx, 

(50) C xz = Cxa-w*,, 

(53) C zz ~ ^zz 4" w zxiCxz txz)y 

and 

(58) C Y z = tyz + w ' yx (Cxz ~~ ^xz) 

or 

(60) Cyz = Gyz 4“ (CvT ”” Cyy) C 1 yxCxz* 

For these equations the value of w xz may be found by substituting z 
for y in equation 24. 
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Problems 

1. Express the weights (Wjy) pf equation 3 as functions of the correlation matrices 
R xy and Rxx. Show all steps in the derivation and all the necessary assumptions 
and definitions. 

2. For the case in which complete information is available on the variables Or) 
subject to explicit selection, express the correlation matrices Rxr and Ryr as func¬ 
tions of the correlation matrices Rxx, t XX} r yy , and t xy . Show all the steps in the 
derivation and all the necessary assumptions and definitions. 

3. For the case in which complete information is available on the variables (F), 
subject to incidental selection, express the correlation matrices Rxr and 'Rxx as 
functions of the correlation matrices Ryy, t XX} r VVy and r xy . Show all the steps in the 
derivation and all the necessary assumptions and definitions. 



14 

A Statistical Criterion 
for Parallel Tests 


1. Introduction 

As indicated in Chapters 2, 3, 6, 7, 8, and 9, parallel tests are tests 
that have equal means, equal variances, and equal intercorrelations. 
For any given set of experimental data, where the parallel forms of a 
test are given to a single group, there will be, even under the best condi¬ 
tions, some small sampling differences. To be certain that the tests 
may be regarded as parallel, it is necessary to have some statistical 
criterion that will show whether or not the means may be regarded as 
samples from a population in which the means are identical, the variances 
may be regarded as samples from a population in which variances are 
identical, and the intercorrelations may be regarded as samples from a 
population in which the correlations are identical. Such a test has 
recently been provided by Dr. S. S. Wilks (Wilks, 1946). Since two 
parallel forms have only one intercorrelation, it is possible in this case 
to check only for equality of means and of variances; hence we must 
consider the case of three or more parallel tests in order to demonstrate 
the statistical criterion for parallel tests. 

We shall not give the derivation of this statistical criterion, which 
may be found in the foregoing reference but shall simply indicate the 
proper statistic to compute, and give the table for evaluating the sig¬ 
nificance of this statistic in the large sample case. 1 

In addition to equal means, variances, and reliabilities, parallel tests 
should have approximately equal validities for any given criterion. 
David Votaw has recently solved this problem as a part of his PhD 
dissertation in mathematical statistics at Princeton University (Votaw, 
1947,1948). 

It should also be noted that, in addition to satisfying statistical 
criteria for being parallel, the tests should contain items dealing with the 

* This material on tests of compound symmetry is given here with acknowledge¬ 
ments to and the permission of the editors of The Annals of Mathematical Statistics. 
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same subject matter, items of the same format, etc. In other words, the 
tests should be parallel as far as psychological judgment is concerned. 
At present, this criterion of psychological judgment is usually the only 
one used. The emphasis in this chapter is on the statistical criteria 
that the tests must satisfy m addition to the psychological criteria. 


2. Basic statistics needed to compute the statistical criterion 
for parallel tests 

Let us assume that k parallel forms of a test have been given to a 
population of N individuals. Assume further that the usual statistics 
have been obtained for such a set of data. These statistics are the 
mean of each test ( M g ), the variance of each test (s g 2 ), and the co- 
variances for each pair of tests ( c g h ). 

It is then necessary to compute the following four quantities: 

Z), the determinant of the variance, covariance matrix, 1 




(i) » 


2 _ 


, the average variance, 


k 

2 C gh 

(2) r = —-, the average correlation, computed as the average 


and 


in .\ 2 ’ — ~--, —— x — 

fc(/c — ijs covariance divided by the average variance, 


£ (M g - M) 2 

g — [ 

(3) v = -;---, the variance of the means, 


k - 1 


where 


k 

Y,M e 

g — I 

(4) M =-, the mean of the test means. 


3. Statistical criterion for equality of means, equality of 
variances, and equality of covariances 

Following Wilks 1 notation, we shall use L mvc to designate the statistic 
appropriate for testing simultaneously the hypothesis that all means 
are equal, all variances are equal, and all covariances are equal. 

1 For the case of three parallel tests, the formulas are given without the use of 
determinantal notation. In order to deal with four or more parallel tests, it is neces¬ 
sary to know how to compute the value of determinants of order 4 or higher. 
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(5) 


vc 


D 

s 2 [l + (fc - l)r][s 2 (l - r) + p]*" 1 ’ 


In the simplest case of three tests, this reduces to 


( 6 ) 


•Si 2 s 2 2 S3 2 [1 + 2r 12 r 13 r 2 3 - r 12 2 - ri 3 2 - r 23 2 ] 
s 2 (l + 2r)(s 2 — s 2 r + t>) 2 


Small sample tables for evaluating this statistic are given by Wilks 
(1946). In the large sample case, according to Wilks, the sta¬ 
tistic —N loge Lmvc is approximately distributed as chi-square with 
(fc/2)(fe + 3) — 3 degrees of freedom when the hypothesis of equal, 
means, equal variances, and equal covariances is true. We are com¬ 
paring an hypothesis using fc means, k variances, and (fc/2)(fc — 1) 
covariances, a total of 2 k + (fc/2)(fc — 1) parameters, with an hypothesis 
using only three parameters; hence the degrees of freedom will be 

2 k + (fc/2)(fc — 1) — 3 — (fc/2)(fc + 3) - 3. 


For three tests, this reduces to 9 — 3 = 6 degrees of freedom. The 
statistic L m vc varies between zero and unity. If the means are identical 
in the sample, the variances identical, and the covariances identical, L mvc 
equals one. As L mvc approaches one, the quantity — N\ogL mvc ap¬ 
proaches zero. The accompanying table gives the 5 per cent and 1 per 
cent points so that, if the quantity — A log i0 L mvc calculated from a given 
set of data is less than the value given in the 5 per cent column for the 
appropriate number of tests (fc), we may consider that the tests are 
parallel. 1 If the value of —A logio L mvc from the data is greater than 
that in the 1 per cent column for appropriate fc, there is less than one 
chance in a hundred that such a sample would be drawn from a popula¬ 
tion in which means were equal, variances were equal, and covariances 
were equal. Under such circumstances we should conclude that the 
tests were not parallel in all respects. 

If L mvc is sufficiently near unity to support the hypothesis that the 
means are identical, the variances are identical, and the covariances are 
identical, the population is characterized by one common mean, one 
common variance, and a common correlation (in this case reliability) 

1 Table 1 is given in terms of common logarithms (that is, to the base 10) instead 
of in terms of the natural logarithms (to the base e), since extensive tables of common 
logarithms are more generally available than tables of the natural logarithms. 
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coefficient. Using the subscript zero to indicate the best estimates of 
these parameters, we have 


(7) 

( 8 ) 
and 

(9) 


V — * + 


Mo = M, 

v(k — I) 


sr - 

k 


>'o — 


So 


where k is the number of parallel tests and the other terms are defined 
by equations 1, 2, 3, and 4. 

It should be noted that the use commonly made of chi-square and other 
significance tests is to test an hypothesis that the experimenter hopes 
is incorrect. The term “significant differenee ,, means that the data 
diverge significantly from what would be expected in view of the hy¬ 
pothesis being tested. In other words, the experimenter tests the 
hypothesis that 11 A = B” while arranging the experimental conditions 
to the best of his ability so that A B. The use of the criterion for 
parallel tests is an instance of testing an hypothesis that the investigator 
hopes will be verified. Since considerable effort has been expended to 
select items and establish norms so that the tests will be parallel, we 
hope that the means, variances, and covariances will be about equal. 
Therefore what we hope to find in this test is what would commonly 
be called an “insignificant difference.” 

Whenever the ideational structure of any scientific field has developed 
sufficiently, investigators will be testing hypotheses that they believe 
are true; hence they will be hoping to find insignificant differences be¬ 
tween the data they get and those to be expected from the hypothesis. 
The current search by psychologists for significant differences is merely a 
concomitant of the fact that they have no precise hypotheses that can 
be tested; hence typically the investigator does his best to shape condi¬ 
tions so that groups A and B will be different, and yet tests the hypothesis 
that “A = B” hoping to find that it is not adequate for the data. Only 
rarely do we find the next step, of devising an hypothesis that pre¬ 
sumably fits the data, testing this hypothesis, and finding a “non¬ 
significant difference,” which indicates that an acceptable hypothesis has 
been found. 


4* Criterion for equality of variances and equality of covariances 

If, on testing for the hypothesis that means, variances, and covariances 
are equal, we find a significant difference (a small value of L mvc or a 
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relatively large value of —N log fc L mvc ), there would then be some interest 
in determining whether or not that difference is attributable solely to 
differences in means. If it can be shown that the variances are equal . 
and that the covariances are equal, while the means are different, it is 
easy to adjust test scores (by adding or subtracting a suitable coastant) 
so that the means of the adjusted scores will be equal. It should be 
noted that a test for equality of means cannot be made on data from 
the group of subjects used to compute the adjustment. However, if 
the norms established from one group are used to adjust test scores for a 
second group, the test for equality of means of these converted scores 
could be made on the second group. In order to determine whether the 
difficulty is with the means alone or with variances and covariances, or 
both, two other statistics are of interest, one for testing equality of vari-* 
ances and equality of covariances and another for testing equality of 
means. The statistic (L vc ) for testing equality of variances and equality 
of covariances is like (. L mvc ) given in section 3 except that the term v is 
omitted from the denominator: 


( 10 ) 


D 

» 2 [1 + (k - 7)7][.s 2 (1 - r)l*- r 


When there are three tests, we have 


( 11 ) 


s i 2 « 2 2 ‘ s 3 2 [1 + 2 r 12 r 13 r 2 3 — r 12 2 — r 13 2 — r 23 2 ] 

__ r)2 — 


The quantity —N\og e Lvc is approximately distributed for large 
samples according to the chi-square law with (/c/2) (k -f- 1) — 2 degrees 
of freedom, when the hypothesis is true. For three tests, —AT log c L vc 
is approximately distributed for large samples according to the chi- 
square law with four degrees of freedom. L vc varies between zero and 
one. As the variances become more alike and the covariances become 
more alike, the value of L vc approaches unity, or the value of —AT log L vc 
approaches zero. If the value of —N logio L vc is smaller than that 
indicated in the 5 per cent column for appropriate fc, we may conclude 
that the tests are parallel except possibly for differences in means. If 
the value of —AT logio L vc is larger than that indicated in the 1 per cent 
column for appropriate k , the tests are not parallel as far as variances 
and covariances are concerned. 

If the test with L mvc indicated that the tests could not be regarded as 
parallel, whereas the test with L vc showed that the tests were parallel 
as far as variances and covariances were concerned, the population 
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represented by the data is characterized by k + 2 parameters. These 
are the k means, one variance, and one reliability coefficient. The best 
estimates of these parameters are, respectively, the test means (M g ) y 
the average variance, s 2 (see equation 1), and the average correlation r, 
as given in equation 2. 

If, after finding a “significant” L m vc> we also find a significant L VCf 
this shows that equating the means still would not make the tests 
parallel. Either the variances, the covariances, or both are significantly 
different. It would be desirable next to test covariances and variances 
separately for significance, for, if the covariances are not significantly 
different, both means and variances can be brought into line by an 
appropriate linear transformation. However, if the covariances are 
significantly different, it is not possible to set up norms that will “equate” 
the tests. Unfortunately, since it is not yet possible to test the signifi¬ 
cance of the difference of covariances independently of similarities or 
differences in variances, for the present this step must be omitted. If 
we have two samples to which the tests have been given, it is possible 
to make the equivalent of a test of covariances independently of vari¬ 
ances by the following procedure. Compute the standard deviations of 
both forms (s x and s y ) for the first sample. If the t/-scores are multiplied 
by s x /sy , the standard deviation of the ^/-scores will be identical with 
that of the x-scores for that same sample. Now regardless of the 
standard deviations of x and y in the second sample, multiply all the 
2/-scores by the multiplier (s x /s y ) determined from the first sample. The 
test with L vc may then be made, using the ^-scores and the transformed 
^/-scores both from the second sample. If the test L vc indicates that the 
forms are parallel, we may conclude that, if the multiplier s x /s y is used 
for the ^/-scores, the forms are parallel. If the test with L vc shows that 
the forms are not parallel, the difficulty probably lies with the covari¬ 
ances, since standard deviations were equated. 

Suppose, on the other hand, that, after finding a significant L wvc , we 
find homogeneity when testing with L vc . It is then reasonable to suppose 
that the difference in means was responsible for the heterogeneity shown 
by the test L mvc . If subsequently the test by L m shows that the means 
are heterogeneous, we have a consistent set of results and can conclude 
that the variances and covariances are equal, and that the heterogeneity 
of the means is the reason for failure to show homogeneity when testing 
with L m or L mvc . If after failing to find a sufficiently large value of 
L mvc , we find large values for both L vc and L m , the results are inconsist¬ 
ent. In this sort of instance (which is not impossible) we are dealing 
with some peculiar borderline case in which the means alone or the vari- 
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ances and covariances alone show homogeneity. With the increased 
degrees of freedom for testing the more comprehensive hypothesis, 
however, we find a significant difference. In such a case the only con¬ 
clusion we can reach is that the tests are not parallel tests, but that the 
difficulty is not clearly indicated as due to mean differences or to vari¬ 
ance, covariance differences. 


5. Statistical criterion for equality of means 

For testing equality of means (if L vc is nonsignificant), we use 


( 12 ) 


s 2 (l — r) 
s 2 (l — r) + v 


The quantity —N(k — 1) log c L m is distributed approximately as chi- 
square with k — 1 degrees of freedom for large samples when the 
hypothesis is true. The value of L m is unity if the sample means are 
identical, and approaches zero as the sample means diverge. The 
quantity —N(k — 1) logL m is zero if the sample means are identical, 
and increases as the sample means diverge. The 5 per cent and 1 per 
cent points for — N(lc — 1) log x 0 L m are given in the last two columns 
of Table 1. 

Wilks (1946) has shown that 


(13) 


LnrVC — Ly C * L t 


k -1 


This relationship may be used as a partial arithmetical check when all 
three values are computed. 

We can also conclude from equation 13 and from Table 1 that, if 
L vc and L m are each small enough to give —N logic L values above the 
5 per cent or the 1 per cent points, the value of L mV c will be small enough 
to give a value of —N logio L mvc that will be above the 5 per cent or 
1 per cent point, as the case may be. It will be noted that the 5 per 
cent and 1 per cent points for —N logio L mvc are in each case less than 
the sum of the two corresponding values for —N logio L vc and 
—N(lc — 1) logio L m . That is, if the means tested by themselves are 
significantly different and if also the variance-covariance matrix tested 
by itself shows a significant difference at the 1 per cent or 5 per cent 
level, the test with L mvc must show a significant difference unless errors 
were made in the computations. 

If Lmvc shows that the tests may be regarded as parallel, whereas one 
and only one of the other tests (either L vc or L m ) indicates a significant 
difference, again we are dealing with a perfectly possible but borderline 
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case, and must conclude that the tests are not parallel either with respect 
to means or with respect to variances and covariances. 

TABLE 1 

Approximate 5 per Cent and 1 per Cent Points for -AT logio L mt)C , —AT logio L** 
and —N (k — 1) logio L m for k — 2, 3, 4, 5, 6 




—N logio 


-N logio L vc 

i 

- 

N(k- l)log, 0 L„ 

k 

d.f. 

5 

per cent 

1 

per cent 

d.f. 

5 

per cent 

1 

per cent 

d.f. 

5 

per cent j 

1 

per cent 

2 

2 

2.60206 

4.00000 

1 

! 

1.66832 

2.88150 

1 

1.66832 

2.88150 

3 

6 

5.4685 

7.3013 

4 

4.12047 

5.7660 

2 

2.60206 

4.00000 

4 

11 

8.5448 

10.7379 

8 

6.7347 

8.7251 

3 

3.39389 

4.9270 

5 

17 


14.5092 

13 

9.7117 

12.0249 

4 

4.12047 

5.7660 

6 

24 

15.8149 

18.6659 

19 

13.0912 

15.7175 

5 

4.8079 

6.5519 


Adapted from Wilks (1946), page 266. 

If N 100, this table is sufficiently accurate. If N < 100, see Wilks (1946) 
for a detailed statement of the accuracy of this table, and for small sample methods. 

Note that the entries in this table arc given in terms of logarithms to the base 10. 
Hence these entries are 0.43429 times the entries in Wilks (1946), page 266. 

6. Illustrative problem for hypotheses H mvc , H vc , and H m 

The computation of the three criteria is illustrated in the following 
example. Three parallel tests (1,2, and 3) are given to 130 subjects. 


Mean 

Standard 

Deviation 

rn 

1 

27.8 

9.9 

.93 

2 

28.3 

10.1 

ri3 .92 

3 

27.9 

10.4 

T23 .90 

M e 

(M t - M ) 2 

s e 

* 2 
s s 


27.8 

.04 

9.9 

98.01 

92.9907 

28.3 

.09 

10.1 

102.01 

94.7232 

27.9 

.01 

10.4 

108.16 

94.5360 

2 84.0 

.14 


308.18 

282.2499 

M 28.0 

(.07) * - v 


i 2 - 102.73 

94.0833 - «*r 


Divided by 2. The other sums are divided by 3. 
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D — Si 2 S2 2 83 2 (1 2r 12 ri3r 2 3 — rj2 2 — fi3 2 — 7"23 2 ) = 20,308.3859. 

s 2 (l - r) = 8.6467 [s 2 (l - r)] 2 = 74.7654 s 2 (l + 2r) = 290.8967. 

s 2 (l - r) + v = 8.7167 (s 2 (l - r) + tf = 75.9809. 

D 

mVC «*(1 + 2r)[.s z (1 — r) + v] 2 
N log]() Jjmvc ~ 4.78. 


L*VC — 


D 


s 2 (l + 2r)[s 2 (l ~ r)] 2 
—N logio Lj vr — 3.87. 
s 2 (l — r) 


= .9338. 


L J m — 


« 2 (1 — r) + v 


= .9920. 


4 


-N(k- 1) log 10 L m = 0.91. 


By reference to Table 1 it is clear that all three criteria show the three 
tests to be parallel. The data are in agreement with the hypothesis 
that the means are equal, the variances are equal, and the covariances 
are equal. 

7. Hypotheses of compound symmetry 

The criterion devised by Wilks (1946) applies only to means, vari¬ 
ances, and covariances of parallel tests. In addition, parallel tests 
should have equal validities for predicting any criterion. The statistical 
criteria for “compound symmetry” presented by Votaw (1948) include a 
statistical test for equal validities of a set of parallel tests. We shall 
present here only a restricted case of one form of compound symmetry, 
the case where we are interested in two sets of parallel tests (x g and z g ) 
and the criteria y g . Let us say that there are k parallel tests in the 
rc-set, / parallel tests in the z- set, and b different criteria to be predicted. 
This set of b + k + / tests given to N persons results in a variance- 
covariance matrix and its determinant l ) given in equation 14. 
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Equation (14) 


8 Vl* 

c vm 

• * c vm 

C V\X\ 

C V\X2 

* * * c V\xk 

C V\Z\ 

C Vl*2 

* *' C Vl z/ 

C VtV\ 

8 V2* 

* * C V2Vb 

C V2X\ 

C V2*2 

* ’ * C V2Xk 

C V2Z\ 

C V&2 

* ' * C V2*f 

C VbV 1 

C VbV2 

• ’ 8 yh * 

C VbX 1 

C VbX2 

’ * * C VbXk 

c VbZi 

C WZ% 

* * * C VbZf 

C X\V\ 

C X\V2 

’ ‘ c *m 


Cxi X2 

* * * C X\Xk 

C X\Z\ 

C Xlt2 

* * * Cxtff 

C X2Vl 

C X2V2 

C X2Vb 

C X2X\ 

8 X2 2 

’ * * °X2 Xk 

C X2Z\ 

CX2Z2 

* * * C X>2Zf 

C XkV 1 

C *kVl 

C XkVb 

C *kX\ 

C *kX2 

S Xk 2 

C *kZl 

C XkZ2 

* * * C XkZf 

C Z\V\ 

C Z\V2 

■ * C Z\Vb 

C »lX\ 

C z 1X2 

C Z\Xk 

8 Z l 2 


* * * C ZlZ/ 

C Z2V\ 

C Z2V2 

' * C Z2Vb 

C Z2X\ 

C Z2X 2 

* * • Ce 2 ik 

C *2Z 1 

*22* 

C Z2Zf 

C ZfV I 

C ZfV2 

' * C ZfVb 

c ZfX\ 

C ZfX2 

• * * C Zf Xk 

CgfZi 

■ • 1 

* • * 8 Z/ 2 


where y gi yn (g, h = 1, 2, - • • b) designates the criterion variables, 

x gy x h (g, h =* 1,2, • • • k) designates the parallel tests of the x-set, 
z g , zn (g, h = 1,2, • • • /) designates the parallel tests of the 2 -set, 
s designates a standard deviation, 
c designates a covariance term, 
b designates the number of criterion variables, 
k designates the number of parallel tests in the x-set, 

/ designates the number of parallel tests in the 2 -set. 

We shall consider three hypotheses ( A m vc , five, and A m ) regarding the 
relationships among the set of tests designated in equation 14. 

Let Arrive designate the hypothesis that, for each set of parallel tests, 
the population means are equal, the population variances are equal, the 
population covariances are equal, and the population covariances with 
any single criterion variable are equal; between any two sets of parallel 
tests, the population covariances are equal. For the case of two sets of 
parallel tests and b different criterion variables that is indicated in 
equation 14, let \l vv y ZgJ and y. Zg designate population means, a Vg} <r Xg 2 , 
and cr g 2 designate population variances, and let f with appropriate 
subscripts designate a population covariance. In terms of this notation, 
hypothesis A mvc asserts that: 

2, • • • k) are equal. 
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2. All n Zt (g = 1, 2, • • • /) are equal. 

3. All <r*, 2 (g = 1, 2, • • • k) are equal. 

4. All a Zg 2 (g = 1, 2, • • • f) are equal. 

5. All t Xg x h (g h = 1, 2, • • • k) are equal. 

6. All f gJfA (g h = 1,2, • • • /) are equal. 

7. All (gr = 1 , 2, * • • fc; h = 1 , 2, • • • /) are equal. 

8. For any fixed value of h (h = 1, 2, • • • 6), all (gr = 1, 2, • • • fc) 

are equal. 

9. For any fixed value of h (h = 1, 2, • • • 6), all tz gVh (0 = 1, 2,•••/) 
are equal. 

Let /? vc designate the hypothesis that, for each set of parallel tests, 
the population variances are equal, the population covariances are 
equal, and the population covariances with any single criterion variable' 
are equal; between any two sets of parallel tests, the population covari¬ 
ances are equal. This hypothesis is identical with fi mvc , except that no 
restrictions are imposed on the means. 

Let fl m designate the hypothesis that, for each set of parallel tests, 
the population means are equal, given that H vc is true. 


8. Basic statistics needed for tests of compound symmetry 

In addition to the determinant of equation 14, and the variances and 
covariances indicated in its elements, the following quantities are needed 
for the tests of compound symmetry represented by hypotheses 
$va and Uni’ 

The mean for each of the k + / predictor tests and the grand mean for 
each of the two sets is needed as follows: 


(15) 


(16) 


(17) 


N 

E a * 

v _ *'- 1 

'* N ’ 


X.. - 


k N 

E E a* 

g=l 1=1 

In : 

N 

E^ 


2 .,- 


t=l 


N 


f AT 

E Ez* 

7 - ■- 1 

fN 


( 18 ) 
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The variance of each set of means is also needed: 
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2 : (*., - *..) 2 

(19) 

E (*■.-*■■)* 

(20) —- 


The averages for certain sets of variances and covariances are needed, 
as follows: 

k 

]C C Vg*h 

(21) c yti = ~—, 


( 22 ) 


Cy t i 


f 

S C VgZh 
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(23) 


U x = 


so 

A=.l 


* ’ 


(24) 


u. = 


so 

A=1 

T~’ 


(25) 


W x = 


*(* - 1) 7 


(26) 


w z = 




''Zg*h 


Cxi - 


/(/ - 1) 1 
fc / 




( 27 ) 
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Using the matrix of criterion intercorrelations from equation 14, and 
the averages defined in equations 21 to 27, we define the determinant J5, 
of order b + 2, as follows: 

Equation (28) 




s yi 2 

r VlV2 

CyiVb ^Vk 

C Vli Vf 

c vtvi 

s 2 

CyiVb Cyti^/k 


C VbVl 

C VbV 2 

$yb Cybr^^k 


C„,fVfc 

Oy 2 S^/~k • 

• • Cy^y/k [u x + (k - 

I)’"*] c rt \/kf 

c vli Vj 



«* + (/“ 


9. The criterion for hypothesis H mvc 

The sample criterion for fi mvc is given by 

f) 

(29) 


f 

vc 


8[u x - U' x + v x ] k l [u z - w z + v z ] f 1 
If N is large and ft mvc is true, the quantity —N log e L mvc is distributed 


approximately as chi-square with 


k+f 


(fc+/+3) + 6(fc+/-2)-7 


degrees of freedom. If only one set of parallel tests is available, for the 
test with Lmvc we have (/c/2) (/c + 3) + b(k — 1) — 3 degrees of freedom. 
The general formulation for any number of sets of parallel tests with N 
large is given by Votaw (1948), page 467. 

For the special case of two parallel tests designated by subscripts 
1 and 2 and a single criterion variable (y), we have 

f _ ~ T V\ ~ r »2 2 ~ r 12 2 ) 

(30) m»c - [gj 2 (u +w) _ 2c>2][m _ w + v] 

where u = (si 2 + « 2 2 )/ 2 , 
w = r 12 SxS 2 , 

— X 2) 2 


c yi = 


+ C V*2 


2 
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If N is large and ft mvc is true, the quantity —AT log e L mvc is distributed 
approximately as chi-square with 5 + 1 — 3 = 3 degrees of freedom. 

When Lmvc is sufficiently near unity to support the hypothesis ftmvc, 
the best estimates (shown below by 0 subscripts) of the common param- 


eters indicated 

in condition^ 1 to 9 for ft mvc , 

are as follows. 

(31) 

* 

ii 

*4 

(see equation 16), 

(32) 

z 0 = z.. 

(see equation 18), 

(33) 

v x {k - 1) 

S* 0 = Ux + 1 

k 

(see equations 19 and 23), 

(34) 

2 , »,(/ - 1) 

S, , = u* + 

(see equations 20 and 24), 

(35) 

_ (W* - vjk) 

^xoaso 2 

s x 0 

(see equations 19 and 25), 

(36) 

_ (Wj - vjf) 

Tzqzq o 

(see equations 20 and 26), 

(37) 

Cxi 

^X02Q ~ 

Sxffizo 

(sec equation 27), 

(38) 

Cy 8 £ 

r XQVg ““ 

Syg^x o 

(see equation 21), 

(39) 

Cygi 

r *oVz ~~ 

S Vg S ZQ 

(see equation 22). 


10. The criterion for hypothesis ft TC 

If the value of L mvc is small (that is, the value of —Nlog e L mvc is 
large), ftmvc cannot be accepted. In this case we may wish to see if 
the differences in means of the parallel tests account for the failure to 
satisfy ft mvc . In order to do this we next investigate hypothesis ft vc . 
For this test the sample criterion is taken as 


Lvc 


B[u x — w x ] k l [u z — 


If N is large and ft vc is true, the quantity —N log e L vc is distributed 

k + f 

approximately as chi-square with —— (k + / + 1) + b(k + / — 2) — 5 
degrees of freedom. If only one set of parallel tests is available, for the 
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test with L w we have (fc/2) (fc + 1) + b(lc — 1) - 2 degrees of freedom. 
The general formulation for any number of tests with N large is given 
by Votaw (1948), page 467. 

For the special case of one criterion variable ( y ) and two parallel tests 
(designated by subscripts 1 and 2), we have 


Si s 2 2 [1 + 2r vl r y2 r l2 - r vl 2 - r y2 2 - ri 2 2 ] 
[s v 2 (u* + w x ) - 2 c v t 2 ][u x - w x ] 


where u x = (s*, 2 + O/ 2 

W x = Cx\X2 

Cyx == (Cyxi 4 " Cyxf) /2 


(see equation 23), 
(see equation 25), and 
(see equation 21). 


In this case, when N is large and ft vc is true, the quantity —N log 6 L v / 
is distributed approximately as chi-square with 3 + 1 — 2 = 2 degrees 
of freedom. 

If the test with L mvc indicated that the tests could not be regarded as 
parallel, whereas the test with L vc indicated that the tests could be re¬ 
garded as parallel as far as variances and covariances were concerned, 
the population represented by the data is characterized by: 


1. A mean for each test, giving k + / + b means, represented by 
X g , Z g) and Fg. 

2. A variance for each criterion variable ($^ 2 ) and the two variances 
u x and u z , given by equations 23 and 24. 

3. Two reliability coefficients given by w £ /u x and w z /u z (see equations 
23 to 26). 

4. The intercorrelation r xz given by fe/V u x u z (see equations 23, 
24, and 27). 

5. Two validity coefficients for each y g given by c yg J($ Vg y/u x ) and 
c VgZ /{s yg y/uz) (see equations 21 and 22). 


II. The criterion for hypothesis 

If the test with L mvc has shown “significant” differences, whereas the 
test with L vc substantiates hypothesis 8 VCf the presumption would be 
that the tests might be regarded as parallel except for the values of the 
means. If five is true , it is possible to test the means directly. The 
criterion for hypothesis ft m (assuming five) Is 

_ ( u x - w x ) k - l (uz ~ Wg/- 1 

m (u x — w x + v x ) k ~ x (u z — w z + v z ) f ~ l 

If N is large and H m is true, the quantity -N log e L m is distributed 
approximately as chi-square with k + f - 2 degrees of freedom. 
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If only one set of parallel tests is used, for the test with L m we have 
k — 1 degrees of freedom. The general formulation for any number 
of sets of parallel tests with N large is given by Votaw (1948), page 467. 
As an arithmetical check it should be noted that 


(43) 


f . f _ t 

u m 1J vc — 


12. Illustrative problem for compound symmetry 

The computation of the three criteria for compound symmetry is 
illustrated in the following example. Information is available on three 
parallel tests (1, 2, and 3) and a criterion (y) for 100 subjects. The 
correlations, means, and standard deviations are: 



y 

1 

2 

3 

y 

1.00 

.64 

.66 

.65 

l 

.64 

1.00 

.88 

.92 

2 

.66 

.88 

1.00 

.90 

3 

Standard 

.65 

.92 

.90 

1.00 

deviation 

21. 

10. 

9. 

12. 

Mean 

101. 

118. 

117. 

119. 


The determinant of the correlation matrix is .014 420 64. Multiply¬ 
ing this by the product of the four variances gives the determinant of 
the variance-covariance matrix ( D ), 

D = 144 X 441 X 81 X 100 X .01442064 = 7,417,723. 

The determinant B = (441)(299.5) - (141\/3 ) 2 = 72,436.5, 

(u x -w s + Vxf- 1 = (108.3 - 95.6 + l) 2 = 187.69, 

( u x - w x ) k ~ l « (108.3 - 95.6) 2 = 161.29. 

From the foregoing results, we have 

7,417,723 


Lpiv 


tjve — 


(72.436.5) (187.69) 

7,417,723 

(72.436.5) (161.29) 

161.29 


= .5456, 


= .6349, 


187.69 


= .8593. 


Table 2 shows the 5 per cent and 1 per cept points for —N log. L, 
which is chi-square, and also the corresponding 5 per cent and 1 per cent 
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TABLE 2 


Approximate 

5 per Cent and 

1 per Cent Points 

FOR — AT UOQs Ij, 

and Also 


— N logio L for Degrees of Freedom 1 to 30 



—N log. L 

-N logio L 

d.f. 

5 per cent 

1 per cent 

5 per cent 

1 per cent 

1 

3.84 

6.64 

1.67 

2.88 

2 

5.99 

9.21 

2.60 

4.00 

3 

7.82 

11.34 

3.39 

4.93 

4 

9.49 

13.28 

4.12 

5.77 

5 

11.07 

15.09 

4.81 

6.55 

6 

12.59 

16.81 

5.47 

7.30 

7 

14.07 

18.48 

6.11 

8.02 

8 

15.51 

20.09 

6.73 

8.72 

9 

16.92 

21.67 

7.35 

9.41 

10 

18.31 

23.21 

7.95 

10.08 

11 

19.68 

24.72 

8.54 

10.74 

12 

21.03 

26.22 

9.13 

11.39 

13 

22.36 

27.69 

9.71 

12.03 

14 

23.68 

29.14 

10.28 

12.66 

15 

25.00 

30.58 

10.86 

13.28 

16 

26.30 

32.00 

11.42 

13.90 

17 

27.59 

33.41 

11.98 

14.51 

18 

28.87 

34.80 

12.54 

15.11 

19 

30.14 

36.19 

13.09 

15.72 

20 

31.41 

37.57 

13.64 

16.32 

21 

32.67 

38.93 

14.19 

16.91 

22* 

33.92 

40.29 

14.73 

17.50 

23 

35.17 

41.64 

15.27 

18.08 

24 

36.42 

42.98 

15.82 

18.67 

25 

37.65 

44.31 

16.35 

19.24 

26 

38.88 

45.64 

16.89 

19.82 

27 

40.11 

46.96 

17.42 

20.39 

28 

41.34 

48.28 

17.95 

20.97 

29 

42.56 

49.59 

18.48 

21.54 

30 

43.77 

50.89 

19.01 

22.10 


For d.f. larger than 30, 

z - V2x 2 - \/2(d.f.) - 1 

is distributed approximately as the unit normal curve. For x the 5 per cent point is 
1.645, and the 1 per cent point is 2.326. 

The values in columns 2 and 3 are chi-square values; those in columns 4 and 5 are 
0.43429 times the corresponding chi-square value. 
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points for —IV logi 0 L, which is 0.43429 times the corresponding chi- 
square value. These values in terms of logarithms to the base 10 are 
given because such tables are usually more readily available; hence some 
workers may prefer to use these values. For this illustrative problem, 
we have 

—N log e L mvc = 60.7 — iVlogio L m vc = 26.3 d.f. = 8, 

~N loge L vc = 45.4 —AT logio L vc = 19.7 d.f. = 6, 

—AT log e tm - 15.2 -AT logio An = 6.59 d.f. = 2. 

By reference to Table 2 we see that these values are considerably 
larger than the 1 per cent point values for the corresponding degrees of 
freedom. The values are clearly significant at the 1 per cent level, that 
is, these three tests cannot be regarded as parallel tests for predicting 
criterion y. 

13. Summary 

The statistic L mV c given in equations 5 and 6 is used to test simul¬ 
taneously for equality of means, variances, and covariances. If they 
are equal, the best estimate of each is given in equations 7, 8, and 9. 

The statistic L vc given in equations 10 and 11 is used to test simul¬ 
taneously for equality of variances and covariances. If these are equal 
the best estimate of each is given by equations 1 and 2. 

The statistic L m given in equation 12 is used to test for equality of 
means (assuming equality of variances and covariances). 

The tables of the 5 per cent and 1 per cent points are given in terms 
of —AT logio Lmvc, — N log 10 L vc , and —AT(/c — 1) logio L m . 4 If the 
value computed from the data is greater than the one found in the table, 
the tests cannot be regarded as parallel. If it is less than the one found 
in the table, the indication is that the tests may be regarded as parallel. 

If one or more of the three statistics L mvC) L vcy and L m show a signifi¬ 
cant difference, we must conclude that the tests are not strictly parallel. 
There is only one combination of results that is impossible. It is not 
possible that L mvc indicates a non-significant difference when L vc and 
L m each indicate a significant difference at the 1 per cent or 5 per cent 
level. The tests can be regarded as parallel only when each of the three 
statistics considered separately shows a non-significant difference. 

The more general case of compound symmetry that includes equality 
of validity coefficients and equality of correlations between two sets 
of parallel tests was also presented. Equations for the three criteria 
Lmvcy Lvc, and L m are presented in equations 29, 40, and 42, respectively. 
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Problems 

1. Three comparable forms of a test are given to 200 persons with the following 
results: 


Mi *= 

54.0 

M 2 - 55.5 

M a - 56.6, 

*i - 

13.9 

$2 *■ 13.5 

*3 - 14.4, 

ri2 * 

.90 

ri3 ® .88 

& 

00 

1 

CO 

V? 


Do these data indicate that the tests are parallel? 

2. The following table gives means, standard deviations, and correlations for four 
of the subtests of the College Entrance Examination Board Comprehensive Mathe¬ 
matics Test, Form WCM-1, April 1948. (These data were supplied by Mr. Richard 
Pearson and Dr. Ledyard Tucker of the Educational Testing Service.) 



Subtest 3 

Subtest 4 

Subtest 5 

Subtest 6 

Subtest 3 


.7350 

.5983 

.6203 

Subtest 4 

.7350 


.6049 

.6357 

Subtest 5 

.5983 

.6049 


.5515 

Subtest 6 

.6203 

.6357 

.5515 


Means 

Standard 

9.0983 

9.5817 

3.0417 

2.4483 

deviations 

3.9411 

4.4291 

1.7569 

1.9330 


(The foregoing data are based on a sample of 600.) 

(а) Can these four subtests be regarded as parallel tests? 

(б) Would additive adjustments to equate the means be sufficient to make the 
tests parallel? 

(c) Can subtests 3 and 4 be regarded as parallel tests with respect to means? 
With respect to variances and covariances? With respect to all three? 

(d) Can subtests 5 and 6 be regarded as parallel tests with respect to means? 
With respect to variances and covariances? With respect to all three? 

3. The following table gives means, variances, and covariances for various grades 
on the College Entrance Examination Board English Composition Test, December 
1946. (These data were supplied by Dr. Ledyard Tucker of the Educational Testing 
Service.) 
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A 

B 

C 

D 

E 

A 

25.0704 

12.4363 

11.7257 

20.7510 

20.9425 

B 

12.4363 

28.2021 

9.2281 

11.9732 

23.4544 

C 

11.7257 

9.2281 

22.7390 

12.0692 

18.0384 

D 

20.7510 

11.9732 

12.0692 

21.8707 

19.8371 

E 

20.9425 

23.4544 

18.0384 

19.8371 

77.897(5 

Means 

14.9048 

15.4841 

14.4444 

14.3810 

28.0550 


(N - 126) 

A * reader’s grade on original theme, question 1. 

B ■* a different reader’s grade on a hand copy of the original theme, question 1. 

(The second reader would not know that the theme had been read before.) 

C *■ Carbon copy of B. (With this copy the reader would know that he was reading 
a theme already graded by someone else as check on the accuracy of reading.) 
D *■ Table leader’s check on the grade assigned in A. He might either let the grade 
stand, or alter it as a result of his check reading. 

E » Sum of reader’s scores on questions 2 and 3. 

(a) Can the four grades assigned to question 1 (A, B, ’C, and D) be regarded as 
parallel grades? 

(b) Can the three grades that were assigned independently (that is, without knowl¬ 
edge of previous grades), A, B, and C, be regarded as parallel grades? 

(c) Can A and B be regarded as parallel grades? 

(d) Can C and D be regarded as parallel grades? 

(e) From these results, what conclusions can be drawn regarding the precaution! 
necessary in checking on the reliability of reading English themes? 

(/) Can B, C, and D be regarded as parallel tests for predicting a criterion E? 

( g) Can A, B, and C be regarded as parallel tests for predicting criterion El 

4 . Given the following table showing means, standard deviations, and intercor 
relations for a criterion y , and three tests x\, x% and x$, on a group of 50 persons cai 
the three tests be regarded as parallel tests for the purpose of predicting the criterion y 


- 

y 

2\ 

22 

23 

V 

1.00 

.52 

.56 

.53 

X\ 

.52 

1.00 

.94 

.91 

22 

.56 

.94 

1.00 

.89 

23 

.53 

.91 

.89 

1.00 

Means 

Standard 

45.0 

29.0 

30.0 

31.0 

deviations 

24.0 

9.0 

8.0 

7.0 
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Experimental Methods 
of Obtaining Test Reliability 


1. Introduction 4 

In previous chapters (see Chapters 2, 3, 6, and 8) reliability was 
defined as the “correlation between parallel tests.” In Chapters 2 
and 3 a definition of parallel tests was given, in terms of equality of 
means, standard deviations, and intercorrelations. Chapter 14 pre¬ 
sented a statistical test for equality of a set of means, a set of variances, 
and a set of covariances. In this chapter we shall consider the different 
possible ways of obtaining parallel test scores. 

The term reliability was introduced by Spearman in his basic papers 
on test theory; see Spearman (1904a), "(19046), (1907), (1910), and 
(1913). Since then there have been many discussions of the various 
factors influencing reliability in relation to the different methods of 
measuring reliability. For an introduction to these discussions see, for 
example, Kelley (1921), Muenzinger (1927), Symonds (1928), Anastasi 
(1934), Adams (1936), Kuder and Richardson (1937), Kelley (1942), 
Guttman (1945), Cronbach (1947), and Thorndike (1947). There are 
many different ways of classifying the factors influencing reliability and 
the methods of measuring reliability. Here we shall consider the fol¬ 
lowing major methods. 

The use of parallel forms. 

Retesting with the same test form. 

Various split-half methods, such as first versus second halves, odd 
versus even items, and the method of matched random subtests 
(either halves or thirds). 

Recently methods of assessing test reliability or homogeneity have 
been devised that do not make use of correlation of parallel scores. In¬ 
stead, these methods use item analysis data to assess the homogeneity 
of the group of items in the test. One of these methods will be con¬ 
sidered in the next chapter. 
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Although the error of measurement, discussed in Chapters 2, 3, 4, 
and 5, is a more basic concept in test theory than the reliability coeffi¬ 
cient, it has become customary during the last forty years to assess 
tests in terms of the reliability coefficient rather than in terms of the 
error of measurement. Since there are advantages and disadvantages 
for each of these measures, it is urged here that both must always be given 
in order to make possible a complete assessment of any test. Otis and 
Knollin (1921) pointed out that the error of measurement is superior to 
the reliability coefficient, in that it does not vary with changes in the 
heterogeneity of the group. This property of the error of measurement 
and its effect on the reliability coefficient were discussed in Chapter 10. 
Kelley (1921) and Franzen and Derryberry (1932a) indicated that, 
although the error of measurement did not vary with group heteroge¬ 
neity, nevertheless the unit in which the error of measurement was ex¬ 
pressed did vary from one test to another. They suggested several 
ways of overcoming this disadvantage. Lincoln (1932) and (1933) 
pointed out that reliability could be very high even when the differences 
between two sets of measures were very large. This point was also 
amplified and clarified by Ackerson (1933). 

The tests or subtests that are correlated to determine test reliability 
should be parallel both in the sense that they satisfy the statistical 
criteria for parallel tests presented in Chapter 14 and in the sense that 
the items appear to require the same psychological processes and the 
same type of learning on the part of the subjects. This latter criterion 
depends on the judgment of the test technician and the subject matter 
expert, and it will be different for each different type of aptitude and 
achievement test. We shall consider here only general methods of 
setting up parallel tests or subtests, which are common to all types of 
material. 

2. Use of parallel forms 

For most sorts of situations, it will be found that the best method of 
obtaining a test reliability is to construct parallel forms of the test and 
administer them on different days to the same group of subjects. The 
method usually used would be to construct two parallel forms for this 
purpose. However, from the discussion presented in Chapter 14, we 
see that with three parallel forms it is possible to make a more complete 
check and to be certain that the forms are parallel, not only with respect 
to means and variances but also with respect to correlations. 

There is only one situation in which the use of parallel forms admin¬ 
istered on different days is not advisable. This is when the ability 
that is being tested changes markedly in the interval between the tests. 
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For example, if we wish to determine the reliability of a typewriting 
test by administering one form to a group on Monday and another form 
on Friday, the method would not work if the group was practicing (and 
hence improving rapidly in typewriting ability) during the intervening 
time. Likewise the method is not good if the first test is given when 
the subjects are in excellent “form” and the second test is given when 
the subject’s ability has decreased, for lack of practice during the inter¬ 
vening week. 

The same sort of consideration applies, for example, to any test of 
physical fitness or muscular skill. The two administrations of the test 
cannot be used to estimate the reliability of the test if there is good 
reason for believing that the subjects have either improved or declined 
in the ability that is being tested. 4 

For most tests of scholastic achievement and mental ability, it is 
reasonably easy to be sure that the subjects have not actually changed 
markedly during the period intervening between two tests. For other 
types of performance, of which athletic skills of various types are a 
good example, it is very difficult to maintain a group at a state of uni¬ 
form excellence. The skill is likely to deteriorate with lack of practice, 
and may either improve or the person may “go stale” with practice. 
In such cases all the “error of measurement” cannot be attributed to 
the test. Much of what shows up in the statistical check as error of 
measurement is actually true variation in ability. However, from 
another point of view we must perhaps recognize that measurement of 
some skills is extremely unreliable (regardless of the cause of this un¬ 
reliability); hence in using any such measures we must for many pur¬ 
poses treat them just as we would treat very unreliable measures. 

However, if we are dealing with a period of time during which the 
ability measured will not change systematically for different members 
of the group, and are dealing with a group of subjects under conditions 
such that it is not likely that the ability will change, the use of different 
forms of the test is the most realistic method of indicating reliability. 

It should be noted that as tests are actually used, if several forms of 
a test are available, we are likely to use any of the forms somewhat in¬ 
differently. Likewise, if we are testing a freshman class, the test is 
likely to come on different days in different institutions, or in different 
years. We can thus see that any form of the test may be given, and it 
may be given on any day, so that variability introduced by change of 
form and change of day would normally enter into the error of measure¬ 
ment of a test. 

It should also be pointed out that the error possibilities noted above 
can be easily detected. If the group has improved or deteriorated, the 
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mean will be higher or lower the second time. If some persons have 
improved and others deteriorated, the standard deviation will in all 
likelihood have changed. A complicated set of influences in which some 
persons improve and others deteriorate in such a way that the mean and 
standard deviation of the group remain the same is a possibility, but 
would doubtless be very rare. 

In summary, the method of testing with parallel forms given several 
days apart is a method that allows the relevant sources of error to influ¬ 
ence the reliability coefficient. If the statistical tests for equal means 
and standard deviations are used, and satisfied, the method is one that 
may be used routinely with relatively little fear that undetected and 
irrelevant factors are rendering the obtained reliability coefficient either 
spuriously high or spuriously low. 

It should be noted, since speed tests will enter prominently into some 
of the later discussion, that the parallel forms method is valid for speed 
tests. A speed test is a test composed of very easy items—items so 
simple that everyone could answer them if given time. For example, a 
set of two-digit additions given to eighth-grade students would approach 
being a “speed test.” If we are to get a good range of scores on such a 
test, it is necessary to have a large number of items, and to set a time 
limit so short that only the best people in the class finish, if at all. In 
such a test, practice effect from one time to the next is important. Un¬ 
less such conditions as amount of practice and use of “fore exercise” 
were very carefully standardized, it would not be possible to have the 
mean and variance of the parallel forms the same for the group. How¬ 
ever, if means and variances are the same one can be reasonably certain 
that the intercorrelation between the two parallel forms is a reasonable 
approximation to the reliability coefficient that the test should have. 

A parallel form reliability may also be secured by administering both 
forms at the same session. In some tests there may again be a marked 
difference of performance due to the fact that the giving of the first test 
influenced the second test. For example, if it is a speed test of two-digit 
additions, it is likely that for many persons, particularly the poorer 
ones, the score on the second test will be much better because of the 
practice on the first test. Of course this could easily be detected in the 
results because the mean would be larger on the second form. There 
are also other tests for which the performance on the second form is 
likely to be much worse than the performance on the first form. Any 
test that is fatiguing to the subjects would clearly fall in this category, 
and again such fatigue could easily be detected from the results. The 
average would be lower for the second test than for the first. 

If the foregoing rather obvious and easily detected difficulties were 
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not present, the major difficulty with reliability obtained by the suc¬ 
cessive administration of parallel forms is that it is too high. This is 
because there is no possibility for the variation due to normal daily . 
variability to lower the correlation between parallel forms. Woodrow 
(1932) in his study of quotidian variability gathered evidence to show 
that there are day-to-day variations in test performance. 

Several other writers have pointed out that sometimes a low correla¬ 
tion between two parallel forms of a test indicates that the test is an 
unstable measure of a stable trait; at other times such a low correlation 
may arise from a stable measurement of an unstable trait. Instability 
in either the test or the trait would result in a low correlation between 
parallel forms. Methods of determining the instability of a trait as 
distinguished from the instability of a test have been suggested by* 
Paulsen (1931), Thouless (1936) and (1939), Preston (1940), and Jack- 
son and Ferguson (1941). We can conclude then that, if parallel forms 
of a test are given on the same day and if the statistical criterion for 
parallel forms is satisfied—namely, equal means and standard devia¬ 
tions—the reliability obtained is likely to be higher than that which 
would be obtained if day-to-day variability had also been allowed to 
affect the reliability. 

Generally speaking then, the use of two or three parallel forms admin¬ 
istered on different days is the best method of determining reliability of 
a test. However, since several parallel forms are frequently not avail¬ 
able, and since it is sometimes difficult to secure cooperation from sub¬ 
jects for an extended period, we shall consider the possibilities of ob¬ 
taining an indication of reliability when only one form of a test is 
available. 

3. Retesting with the same form 

Sometimes, when two parallel forms of a test are not available, it is 
possible to get an estimate of the reliability by administering the same 
test twice. Usually it is preferable to do this at rather widely separated 
times. Again with this method we should watch out for a practice or a 
fatigue effect that would be readily detectable in most instances by 
observing the distributions of test scores for the first and second admin¬ 
istrations. Aside from such an effect, the major danger in such a tech¬ 
nique is that the reliability will be too high because there will be a 
tendency for the subject to duplicate his former performance. That 
is, if the subject does not know the answer to an item, but makes a 
lucky guess and gets it right, he is likely to make the same guess next 
time and again secure credit for the item which he really does not know. 
Likewise, if he makes some minor mistake, and as a result answers in- 
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correctly an item that he normally would answer correctly, he is more 
likely to repeat this performance when the same test is regiven. Such 
an effect could not occur if the person were taking a parallel form that 
would not contain the same items. In other words, the performance on 
a repetition of a test is likely to be much closer to the original score 
than the performance on a parallel form of the same test. This method 
of repetition of the same test at a different time should in general not 
be used, since it will give a spuriously high coefficient, and the amount 
of error is not easy to determine. 

The major exception is probably some simple perceptual discrimina¬ 
tions for which parallel forms cannot be devised. For example, a test of 
pitch discrimination or a test of auditory threshold for different pure 
tones can probably be regiven without such an effect. The person sim¬ 
ply judges each time whether he hears a tone or whether he does not 
hear a tone. In such a test there does not seem to be a ready way in 
which the person could spuriously duplicate his errors and successes of 
the previous set of trials. However, even in such simple tasks, it is fre¬ 
quently desirable to devise several different measuring techniques and 
correlate them, as well as to get the reliability for a repeat test by the 
use of each method. In general we may say that, even where it seems 
that a repetition of the same form is all that can be done, it is well for 
the test constructor to use some ingenuity and to get at the given factor 
in several different ways that he believes are roughly comparable, and 
then to see how well the different tests agree. New light will frequently 
be cast on the function being measured in this way. See, for example, 
the tests of auditory discrimination used by Karlin (1942). Studies of 
performance on retesting with the same form have been made by Wood- 
row (1932), Jackson and Ferguson (1941), and Greene (1943). 

4. General considerations in split-half methods 

Usually, when only one form of a test is available, reliability is deter¬ 
mined by a “split-half” method. This means that the items of the one 
form are divided into two forms, each with half the number of items of 
the original form. Typically, the subjects do not know that the test is 
to be scored in two parts, and do not know which items are in which of 
the halves. The experimenter need not, and frequently does not, decide 
how the items are to be divided until he sees the test results. However, 
from the viewpoint of setting up efficient scoring procedures, it is desir¬ 
able to decide on the division into two subtests before the test is set up 
for printing. 

The methods discussed in previous sections (either the parallel forms 
method or retesting with the same form), provided the experimenter 
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with two scores. In such a case the reliability is given directly by the 
Pearson product moment correlation between the two scores. A slightly 
modified method is necessary when reliability is to be obtained from two 
subtest scores obtained from a single test. One method is to correlate 
the two half scores, and then substitute this correlation in the Spearman- 
Brown formula for double length (formula 30, Chapter 6). We may 
write 


( 1 ) 


T* — 

•XX — 


2ri Z 

1 + r l2 ’ 


where r' xx designates the reliability of the total test as estimated by cor¬ 
recting the split-half correlation to double length, and 
ri 2 designates the correlation between the two halves of the test. 


Another method of obtaining the reliability of the total test from 
information contained in two subtest scores is to use the formula pre¬ 
sented by Rulon (1939), 


( 2 ) 


r" 

• XX 


= 1 « 


where s 2 is the variance of Xi — x 2 , the difference of scores on the two 
halves of the test, 

s x 2 is the variance of scores on the total test, the sum of the scores 
on the two halves of the test (x = x% + x 2 ), and 
r" xx is used to designate the test reliability as given by equation 2. 


Flanagan (19376) has suggested that the use of this formula in conjunc¬ 
tion with a test-scoring machine provides a rapid and efficient method 
of obtaining test reliability. 

If it is easier to calculate the variance of X\ and the variance of x 2 
than it is to calculate the variance of the difference (x\ — # 2 ), r" xx may 
be written as 



where s xi 2 is the variance of the X\ subtest scores, 

s x 2 is the variance of the x 2 subtest scores, and the other terms 
are as defined in equation 2. 


Guttman (1945) derived this equation as lower bound (L 4 ). He 
showed that, under the assumption that $i — s 2 , this formula is identi¬ 
cal with equation 1. Guttman also points out that, since this formula 
gives a lower bound, it may be that in some cases the reliability coeffi¬ 
cient of a test has been underestimated, and that this fact may explain 
why correlations corrected for attenuation are sometimes above unity. 
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In order to prove that equation 2 is equal to equation 3 we note that 
the variance of a sum is equal to $i 2 + « 2 2 + 2 r ia s 1 s 2) and that the 
variance of a difference is equal to «i 2 + s 2 2 — 2 r 18 «iS 2 - Substituting 
these expressions in equation 2 gives 

„ «i* + « 2 2 - 2 r 12 s 1 s 2 

( 4 ) r x * *> 1 - - —z | - • 

Si + s 2 + 2r 12 Sis 2 

Putting this over a least common denominator and simplifying, we have 

(5) r " = 4ri2SlS2 

( ’ XZ si 2 + s 2 2 + 2r 12 s 1 s 2 

If we put equation 3 over a common denominator in the same way we 
obtain 

= |~ 2r 12 si8 2 ] 


( 6 ) 
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which is identical with equation 5 derived from equation 2; hence the 
two formulas (2 and 3) for r" xx are identical. 

Rulon (1939) has shown that r' xx given by equation 1 and r" xx given 
by equations 2 or 3 are identical if the standard deviation of subtest 1 
is equal to the standard deviation of subtest 2 . It may also be shown 
that, whenever $i ^ s 2 , r" xx < r f xx . If we divide the numerator and 
denominator of equation 4 by s 2 2 , and write h for the ratio Si/s 2 , we 
have 

h 2 + 1 ~ 2 r l2 h 

(7) = 1 “ h TT / 

h 2 + 1 + 2r\ 2 h 

By taking the derivative of r" xx with respect to h y and setting it equal 
to zero, we can show that, for all positive values of h , r" xx is a minimum 
if h * 1 . For students not acquainted with the calculus, we may indi¬ 
cate the condition for a minimum value of r" xx by the following alge¬ 
braic transformations. If we put equation 7 over a common denom¬ 
inator, simplify, and divide the numerator and denominator by ft, we 
obtain 

( 8 ) r" xx --- 

ft + 7 + 2ri2 

ft 

By adding and subtracting 2 in the denominator, we may write 

4r 12 


(9) 
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+ 2 + 2fj2 
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We can see by inspection that, if h « 1, r" zx « r' xx \ and, if h has any 
positive value other than unity, r" zx < r' xx . Since the ratio of two 
standard deviations is always positive, it follows that r' xx ^ r” xx . 

The corrected split-half correlation (as indicated in equation 
1) is identical with the reliability as computed by equations 
2 or 8 if the variances of the two halves are equal . If the 
variances of the two halves are unequal , the corrected split- 
half estimate of reliability will be larger than the value given 
by equations 2 or 3 . 

It should also be pointed out that the statistical tests of Chapter 14 
make it desirable not to use the correlation between two subtest scores 
for the estimation of reliability, but to divide the total test into three or , 
possibly four parts, and to test the similarity of these parts as well as 
to obtain the correlation between them. These correlations can then 
be used in the generalized Spearman-Brown formula (see equation 10, 
Chapter 8 ), with K set equal to three or to four and the reliability of 
the total test estimated. By using this method we know that we are 
using a correlation between parallel subtests as the basis for obtaining 
reliability. This means that the reliability found will not be too low 
because non-parallel subtests were chosen as the basis for estimating 
reliability. It is interesting to note that the use of more than two sub¬ 
tests in determining reliability has been suggested by Cureton (1931), 
Dunlap (1933), and Stephenson (1934). 

The major problem in using subtest scores for the purpose of estimat¬ 
ing reliability is dividing the original test into equivalent subtests. We 
shall next consider some of the methods of dividing a test into subtests, 
and the advantages and disadvantages of each. 

5. Successive halves or thirds 

Dividing a test into comparable halves or thirds is not a simple mat¬ 
ter. For example, the easiest way to divide the test is to take the first 
half of the test against the second half of the test. Often such a method 
will not result in parallel tests at all. For example, if the test is given in 
one session and is a timed test, any items that are not answered for lack 
of time will be in the second half of the test. The score on the second 
half will be lower than on the first half. For a speed test composed of 
easy items the results of plotting score on first half against score on 
second half are very peculiar. All subjects who did not reach the second 
half would score zero on it, regardless of what their score was on the first 
half. If the test is a pure speed test, in the sense that the vast majority 
of subjects get the item correct if they try it, so that the only errors are 
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“items not yet attempted,” everyone who finishes the first half gets a 
perfect or near-perfect score on it, regardless of his score on the second 
half. Figure 1 is such a scatter diagram. Clearly any correlation worked 
on such a diagram could not be interpreted as a reliability coefficient. 
Probably such a pine case will rarely be found. But wherever the score 
is in large part determined by the fact that time is called before many 
subjects have finished, this situation will be approximated, and the 
first versus the second half will not be “comparable halves” suitable for 
obtaining an estimate of the reliability coefficient. 



Figure 1 . Showing the relationship between scores on the first and last halves of a 
test for a pure speed test. 

It might be thought that, if all subjects finished two-thirds of the 
test, we could correlate the first third with the second third of the test, 
and correct this coefficient to triple length. However, such a method 
is valid only if the last third is parallel to the two matching halves 
secured from the first two-thirds. If the difficult items are at the end 
of the test, it is .impossible to make any plausible guesses regarding what 
would happen if the time limit were increased so that everyone could 
finish the test. Furthermore, such a method does not give the reliability 
o i the test with the shorter time limit. It estimates what the reliability 
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might be if the time limit were such that practically everyone finished 
the test. If the time limit is important, we must use a parallel form 
method of estimating reliability. If the time limit is generous so that. 
most subjects finish the test, it may be possible to estimate reliability 
from subtest scores. 

In addition to the problem of time limits on a test, the problem of 
item difficulty must also be considered. Many tests are constructed 
with the easy items first, the items of average difficulty next, and the 
most difficult items at the end of the test. Clearly, if the items in the 
test are in difficulty order, the first and second halves will not be com¬ 
parable halves. 

It can be seen that, if a test contains a number of items of average 
difficulty, and is then lengthened by adding more very difficult items/ 
the reliability of the test will decrease, despite increased test length and 
the increased testing time. The new added items will be answered on 
a chance basis by most of the persons; hence it will be a matter of acci¬ 
dent whether they get the new items right or wrong. As a larger num¬ 
ber of very difficult items are added, a larger component of the score 
will be due to guessing, and this component will decrease the reliability 
of the score on the augmented test. This in no way contradicts the 
Spearman-Brown formulation on the relation of test length to test re¬ 
liability, since in this formulation it was assumed that the new set of 
items was parallel to the old ones. This means that the items were of 
similar mean, standard deviation, and reliability. The new items sup¬ 
posedly added here would be difficult items with a lower mean; and, 
since they would be answered on the basis of chance, the reliability of 
this new portion and its correlation with the easier part of the test would 
each be near zero. 

From considerations such as these, we see that the effect of increasing 
the time limit on a test is difficult to predict. Increasing the time limit 
will permit subjects to answer more items; hence it may be thought of 
as increasing the effective length of a test. However, many of the sub¬ 
jects will not know the answers to the more difficult items at the end of 
the test; hence they will guess about these items and add a chance in¬ 
crement to their score. This increment will not remain stable from 
form to form; hence it will lower the reliability of the test. 

If we wish to use the first and second halves (or the successive thirds) 
of a test for computing reliability, it is possible to plan the test to over¬ 
come both the problems raised by time limits and by the difficulty 
ordering of items. For the first versus the second halves method, for 
example, we arrange the test items so that the item difficulty range in 
the first half of the test is duplicated in the second half of the test. 
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Then, if sufficient time is given so that everyone, or practically every¬ 
one, has a chance to finish the test, the first and second halves will be 
comparable unless there is either a practice or a fatigue effect as a sub¬ 
ject goes through the test. If a test is given in two sessions, with time 
out between for rest and relaxation, if item difficulty is equated between 
the two sessions, and comparable time allowances are given for each 
session, it is probable that a good estimate of reliability can be obtained 
by correlating the first session with the second. 

For example, the University of Chicago comprehensive examinations 
are six-hour examinations given in two three-hour sessions, with two 
hours elapsing between sessions. It was found feasible to construct the 
examinations so that the subjects or topics covered in the morning ses¬ 
sion were dealt with again in the afternoon session, and consequently 
the morning and afternoon sessions were roughly comparable in length, 
in item difficulty (as judged by the committee constructing the examina¬ 
tion), and in topics covered. When the examinations were set up in this 
way, the correlation between morning and afternoon sessions was found 
to be about the same as for other selections of comparable halves. 

Likewise, for radio code receiving tests, use has been made of a corre¬ 
lation between errors on first versus second halves of the test. In this 
case the material received by the students is of about comparable diffi¬ 
culty all the way through, since the same characters (letters of the 
alphabet) are used, and they are sent at the same speed throughout the 
test. Also there is no question of a differential time limit. All subjects 
must keep up with the rate of sending, or skip characters and start 
afresh at all times. Furthermore, these particular tests are short, only 
about three minutes in length, so that there is relatively little oppor¬ 
tunity for a fatigue effect. The tests are preceded by about ten minutes 
of warming-up practice by the students so that there is probably little 
consistent improvement from the early to the later parts of the test. 
It should be noted that for these radio code receiving tests the best 
method of testing reliability is provided by a parallel form, but it should 
be given very shortly after the first form in order to avoid the effects of 
practice. The method of using errors in odd versus even words would in 
this case result in a spuriously high reliability, since the making of an 
error in one word is likely to throw the student off a little, and he is 
also likely to miss the next word or two before he gets “back into stride” 
again. Thus it is likely that the correlation between errors made on 
odd- versus even-numl>ered words would be considerably higher (and 
spuriously higher) than the correlation between parallel forms of the 
test administered under comparable conditions with a relatively brief 
period between the two tests. Also each test should be given with 
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suitable provision for a warming-up period that is not counted on the 
test score. 

6. Odd versus even items or every nth item 

By far the commonest form of comparable halves is the odd-even 
items division. It is probable that this method never gives too low a 
value for the reliability coefficient. If there is an error it is always in 
the direction of a reliability that is spuriously high. Sometimes, as will 
be shown, the odd-even reliability seriously overestimates the test re¬ 
liability as indicated by parallel forms. 

It can readily be seen that, if the items are in difficulty order, the odd 
items will have about the same average difficulty and spread of difficulty 
as the even items. If there is any bias, it is likely to be that the odcU 
items will be on the average very slightly easier than the even items. 

In using this method, however, we must be certain that there is no 
dependence of one item on another. For example, in the radio code 
tests just mentioned, success or failure on one item—particularly fail¬ 
ure—is very likely to influence the performance on the next item. In 
some tests we find a series of questions on a given topic, and it is some¬ 
times difficult to decide whether the items are independent, in the sense 
that knowing the answer depends primarily upon whether or not we 
have studied the topic or whether there is a spurious dependence, as in 
the case of errors in radio code. In performance tests, where the subject 
is to assemble or disassemble a mechanism, and is graded on the various 
steps, there is very likely to be a spurious relationship, in the sense that 
the subject learns or does not learn a certain set of acts as a unit while 
the examiner, in order to grade the performance objectively, sets up 
numerous rather artificial divisions. In such cases as these, it seems 
that the fair test to apply is: “Would you as an examination constructor 
set up such halves as separate tests?” In performance on assembly of 
apparatus, it is doubtful if the test constructor would want the students 
to go through the entire performance, as would be necessary, and grade 
them on only half the points that it was possible to observe. In a set 
of statements describing the characteristics of rods and cones in the 
eye, for example, it is rather likely that the test constructor would 
assent to using half the statements for a shortened form of the test. It 
might readily be, however, that the odd items would not constitute a 
satisfactory parallel form for the even items. The items should be in¬ 
spected to insure that the type of subject matter and the difficulty dis¬ 
tribution for one of the halves are roughly paralleled in the other half. 

Odd-even correlation is also spuriously high on a test with a rather 
stringent time limit so that a large number of subjects do not finish the 
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test. If a subject fails to answer the last ten items in the test, obviously 
he “misses” all of them. Thus he gets five points more on his “odds” 
error score, and likewise five points more on his “even” error score. It 
is highly probable that careful observation will show that many of the 
published reliabilities are spuriously high because of this factor. Again 
this type of error can be strikingly illustrated in the “pure speed” test 
to which previous reference has been made. If every subject gets all 
items correct as far as he goes, one who finishes ten items will have an 
odds score of five and an evens score of five. If he finishes eleven items, 
he will have an evens score of five and an odds score of six, and then at 
twelve items the score will be six and six. That is, the odds and evens 
scores will either be identical, or the odds score will be one point greater 
than the evens score. The correlation scatter plot will appear as in 
Figure 2, and the correlation will be well over .99. Again such a pure 
case probably never occurs, but an approximation to it (coupled with a 
spuriously high reliability) occurs whenever the odd-even method is 
used, and not all the subjects finish the test. 
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Figure 2. Showing the plot of odd versus even items for a pure speed test. 

It should be noted that the odd-even reliability is probably still too 
high, even when the items are in order of difficulty, and all persons are 
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allowed to finish the test, and the items are independent of one another 
(in the sense that making a mistake on one item does not of itself in¬ 
crease the probability of making a mistake on another item). Any 
variability due to day-to-day variations in ability is ruled out, and even 
the variation that might be caused by a slight practice or fatigue effect 
as we progress through the test is ruled out. If we use the parallel 
forms as a standard, the odd-even reliability, as generally applied to 
most tests, probably gives a result that is too high, owing to the careful 
control of various other sources of variation and also to the fact that 
most tests are timed tests having a fair proportion of the score depend¬ 
ent upon most of the subjects not getting a chance to try the last items. 
To the extent that a test is a speed test, the score depending on how, 
rapidly a subject works in the given time limit, there is no way of esti¬ 
mating the reliability except by testing a second time with a parallel 
form. 

There have been several studies comparing the odd-even with parallel 
form reliability and with the correlation between test and retest 
(with the same test). See Foran (1931), Jordan (1935), Goodenough 
(1936), Remmers and Whisler (1938), Ferguson (19416), Greene (1943), 
and Jackson and Ferguson (1941). These experiments show clearly 
that different methods of measuring reliability give different results. In 
general, the parallel form reliability is lowest, and the odd-even (cor¬ 
rected) is the highest. 

It might be thought that, if everyone finished two-thirds of the test, 
we could use an odd-even reliability on the first two-thirds, get the 
correlation between these two thirds, and then correct to triple length. 
However, this gives an estimate of the reliability of the total test on 
the assumption that everyone finishes the test. It does not give any esti¬ 
mate of the extent to which a given subject will hit the same speed rate 
on different administrations of the test; and hence will get to the same 
point in the test. There is no possible way to estimate this factor accu¬ 
rately except by giving parallel forms with comparable time limits and 
under standard directions , and then observe the extent to which the 
score is the same. 

7. Matched random sub tests 

If a single test score for each subject is to be used in estimating test 
reliability, it is necessary to regard this single score as divided into two, 
three, or four equivalent subtest scores. In the preceding sections we 
have seen that under certain conditions successive halves or thirds of a 
test can reasonably be regarded as parallel forms, whereas under other 
conditions the successive segments of a test are clearly not parallel 
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forms. Similarly, assigning every second or every third item to one of 
two or three subtests is sometimes a good method for obtaining parallel 
subtests, and sometimes a poor method. 

If a test is composed of a large number of independent items and is 
administered with a liberal time limit, it can usually be divided into 
parallel subtests. If the test has only a few item groups in it, as, for 
example, in most mechanical assembly tests or in tests involving the 
writing of a paragraph in English, it may or may not be possible to de¬ 
vise a test that is composed of parallel subtests. If a single time limit 
that is short is used, there is no possibility at present of getting any 
valid estimate of the reliability of such a test by using subtest scores of 
any sort. 

If item analysis data are available on a test (that has a large number 
of independent items and a liberal time limit), items may be matched 
on such item analysis data and assigned to the subtests. This is an 
excellent method of insuring that the subtests will be parallel. For 
example, suppose that the percentage of persons answering the item 
correctly ( p ) and a biserial or point biserial correlation with total test 
score (r) are available for each item. The best procedure for construct¬ 
ing parallel subtests is to represent each item by a point on a scatter 
diagram, the abscissa of which represents p and the ordinate r. In 
order subsequently to identify the items, each point should be num¬ 
bered with the item number, as shown in Figure 3. Items may then be 
simultaneously matched on p and r, and a ring drawn around the 
matched pairs, triples, or quadruples, as shown in the diagram. It is 
important to note that, if the test is heterogeneous with respect to item 
type or with respect to type of subject matter covered, it is important 
to match items for subject matter, item type, etc., as well as for 
p and r. 

One member of each group should then be randomly assigned to a 
given subtest. For example, if only two subtests are to be formed, the 
assignment could be made by tossing a coin, and assigning the lower 
numbered item of the. pair to form A if the coin showed heads and to 
form B for tails. In constructing three parallel subtests, it is necessary 
to assign each triple of items to the three parallel subtests by a some¬ 
what more complicated procedure. For example, the items in each 
triple may be identified by item number, as low, medium, and high 
(L, M, and H). There are then six possible ways of assigning these 
three items, one to each of three subtests. Each such order may then 
be assigned a number from 1 to 6 (1 = LMH, 2 = LHM, etc.), and each 
triple assigned according to the throw of a die. 

If such item analysis information is available before the test is assem- 
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bled, scoring routines are much simplified if the items of one subtest 
are put first, another second, etc., or else if items from the different 
subtests are distributed successively through the test. For example for 
three subtests, A, B, and C, the items might be arranged ABCABC, etc. 



Figure 3. Showing how to construct three parallel subtests or tests by simulta¬ 
neously matching items on a difficulty and reliability index. 


It must again be emphasized that no matter which order of items is 
used, it is necessary to allow time for almost all students to finish almost 
all items. It is not possible to estimate test reliability from parallel 
subtests if the test score is markedly influenced by the time limit. 

An analogous method may also be used in attempting to build a 
second test to match a first one already in use. Figure 4 illustrates the 
use of such a procedure in developing an aptitude test for the U. S. Navy. 
In this case the items for form 1 were already in use when form 2 was 
constructed, so that the two forms could not be matched as well as if it 
had been possible to set up two forms simultaneously. 
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.05 .10 .15 .20 .25 .30 .35 .40 .45 .50 .55 .60 .65 .70 .75 .80 .85 .90 .95 
p = Proportion passing item 

Figure 4. Selecting items for a second form of a test to match items previously 
selected for a first form. Item analysis data used in the selection of items for 
GCT Form 2 Analogies. [From Satter (1944), OSRD Report 3756; Applied Psy¬ 
chology Panel, NDRC.) 

In conclusion, it should be pointed out that a statistical criterion for 
parallel tests is now available. We should use it on the subtest scores to 
find out whether or not the precautions used in constructing the sub¬ 
tests resulted in parallel sets of scores. In order to make complete use 
of the methods of Chapter 14 in testing covariances as well as variances, 
it is necessary to have three subtest scores. 
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8. Reliability of essay examinations 

In dealing with the reliability of essay examinations, we encounter 
certain special considerations that are not involved when determining 
the reliability of objective examinations. 

One major problem in essay examinations is the accuracy with which 
the same (or different) readers will grade the examinations. The usual 
method of checking on the accuracy of reading is to have two or three 
different readers assign a mark to the examination independently of each 
other. This means that the readers will agree before hand on the dif¬ 
ferent marks to be assigned, and on the type of papers to which each 
mark is to be given, and then each reader will record his marks on some 
sort of master list of students. It is necessary to be certain that ope 
reader does not see any of the marks assigned by another reader, and 
that an earlier reader does not make any marks on the paper, since these 
could be seen by and perhaps influence a later reader. 

These marks, independently assigned to each of a set of papers by 
different readers are then correlated. This correlation between the 
marks of two readers is known as “the reader reliability” of the exam¬ 
ination. It should be noted that the correlation between marks of two 
readers may be high, even though the means and standard deviations of 
the marks may be radically different. If the mean of reader A is higher 
than the mean of reader B for a given set of papers, it indicates that A 
is an “easier grader” than B. Such a difference can be adequately 
taken care of by adding a constant to each grade assigned by B, or sub¬ 
tracting a constant from each grade assigned by reader A. Correspond¬ 
ingly, if reader A has a larger standard deviation for his marks on a set 
of papers than reader B, this can be corrected by multiplying B’s marks 
by an appropriate constant, or dividing A’s marks by some constant. 
A difference in mean and standard deviation is not serious provided the 
correlation between the marks is high. That is, the papers need not be 
regraded by the readers, but it is essential, in order to have a fair mark¬ 
ing system, that the marks of the different readers be equated in mean 
and standard deviation before being used further. If the reader reli¬ 
ability is low, there is no way of equating marks; it is necessary for the 
readers to discuss their differences of opinion and to regrade the papers 
before the marks can be used. 

A more precise method of handling the problem of comparability 
among readers is to use the methods of Chapter 14 to analyze the results 
of several different readers. The test with L mvc will indicate immedi¬ 
ately whether or not the means, variances, and correlations among the 
set of readers may be regarded as identical or not. If L mvc is near unity, 
we have only to inspect the magnitude of the correlations to see if they 
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are high enough to be satisfactory. In general, we should strive for a 
reader reliability over .90 if it is possible to achieve this level. It would 
seem that a reader reliability of less than .80 is so low as to necessitate 
further discussion and alteration in methods of reading. It should be 
remembered that the reader reliability is an upper bound for the test 
reliability. If two readers looking at the same paper agree to the extent 
of .80, for example, then if different questions (parallel questions) on 
the same material were read by different readers, the agreement is prac¬ 
tically certain to be much less than .80. 

In order to make clear the method of comparing reliability of essay 
examinations with that of objective examinations, we shall consider 
what has been termed “content reliability” (see Gulliksen, 1936). In 
an objective examination scored with a key that has previously been 
agreed on by all persons concerned, the equivalent of “reader reliability” 
is unity. Any difference in scores between parallel tests is due to dif¬ 
ferences in sampling of subject matter content in the two tests, and to 
possible changes in the subject between the time of administration of 
the two tests. If two parallel essay examinations are matched just as 
successfully with respect to content as are two parallel objective exam¬ 
inations, the correlation between the two parallel essay forms will 
practically always be lower than between the two objective forms, 
owing to the fact that the unreliability of reading will still further lower 
the correlation between the two essay forms. In order to determine the 
extent to which the low reliability of an essay examination is due to 
poor agreement among readers or to poor matching of questions in 
parallel forms, it is necessary to determine the content reliability of the 
essay examination. 

For one form of an examination, let us use: 

x\ to indicate the score assigned by reader 1, 
x f 2 to indicate the score assigned by reader 2, and 
x' c to designate the correct score that the student should have re¬ 
ceived on the content of his paper, if it had not been for reader 
errors. It should be noted that x r c is not the “true score.” It 
is comparable to the score on an objective examination and like 
such a score has a true component, and an error due to the un¬ 
reliability of sampling, unreliability of student performance, etc. 
e\ designates the error made by reader 1, and 
e f 2 designates the error made by reader 2. 

For the parallel form we shall use x"\, x" 2 , x" c , e" X} and e" 2 > all de¬ 
fined as above, but for the second form of the test. It is assumed that 
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(10) 


— x'c + e f i, 

(11) 

x' 2 

= x'c + e f 2> 

(12) 

x"i 

II 

+ 

and 



(13) 

x" 2 

+ 

"h 

II 


It is possible to compute the correlations r x ' lX ' 2 and r x " ix " 2 , which are 
reader reliabilities, and also to compute the four correlations of the 
form r X oX " 6 , where a = I or 2 and 6 = 1 or 2. These are test reliabilities, 
attenuated by the inaccuracy of reading. The problem is to express 
ry cX " c (the content reliability of the test) as some function of the known 
correlations. 

First we may obtain the relationship of the reader reliability to the 
variance of x\, and x' c . 


(14) 


TVlx'2 




If we assume that the two readers are reading with equal accuracy, we 
may substitute $ x ' x for s x > r Also for x\ and x' 2 let us substitute their 
values as given in equations 10 and 11, obtaining 


(15) 


rr'iar'o 


2(x 'c + e\)(x' c + e' 2 ) 
Ns 2 x ' t 


If we expand the numerator and assume that the correlations e'ic' 2 , 
e'ix'c, and e' 2 x' c are equal to zero, we have 

(. 6 , ’ ^ 


Tx'ix'* 


Ns 2 x \ 


If we write s 2 x - c for 2 (x' c ) 2 /N and take the square root, we have 


(17) 




z r x'\x\ — 


Sx'i 


By a similar procedure for the other test, and the other reader, we have 


(18) 


vTTT = 

v T x"\x i 


s x ». 


The correlation between the two forms is There are four such 

correlations for different values of a and b. Let us assume that all may 
be regarded as equal, that is, that the tests are parallel and that the 
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error variance due to reader is equal for each reader and each form; then 
one of the four correlations, say r x > ix " 2 may be taken as typical of the 
group. We have then 


(19) 


^x\x n 2 
N Sx'jSx"* 


Substituting equations 10 and 13 in the numerator gives 


/onN 2(*'c + + e" a ) 

(20) *V 1X " 2 ---- 

jy s x 'i s x"2 

Expanding the numerator and noting that the correlations of reader 
errors with each other and with test score (x c ) is zero, we have 


( 21 ) 




By the usual definition for correlation this becomes 


( 22 ) 


*Vi*" 2 = 


rx'cZ"c S x'c S x"c 


S X '*S X 


Substituting from equations 17 and 18, we have 


(23) T x \x"2 — T x’Tx\x'2 ^ Tx"\x"2* 

Solving for r x > cX " c1 the content reliability, we have 


(24) 


TV,*'' = 


"V/tVis'; 


X iX 2' x IX 2 


It will be noted that equation 24 is identical in form with the correc¬ 
tion for attenuation, equation 21, Chapter 9. It gives the correction for 
the “attenuation due to inaccuracy of reading.” 


Th& reliability of an essay test corrected for attenuation due 
to the inaccuracy of reading has been termed the content reli¬ 
ability of the essay test . The content reliability is equal to 
the correlation between parallel forms divided by the geometric 
mean of the reader reliabilities of the two forms. (See equa¬ 
tion 24.) 


9. Summary 

Three main methods of determining reliability have been considered. 
1. Parallel forms . Generally speaking, this method is best, provided 
that we can regulate the interval between the two tests and the activity 
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of the subjects during that interval so that the influence of practice, 
fatigue, and other similar effects will be negligible. If three parallel 
forms are used, and the statistical criterion for parallel tests given in 
Chapter 14 is applied, score changes due to practice, fatigue, etc., are 
detected immediately and routinely. 

It should be especially noted that, if the score variance depends in 
any large part on unanswered items at the end of the test (for example, 
if speed is an important factor in test score), it is necessary to use the 
parallel forms method. Neither of the other two methods is satisfactory 
in this case. 

2. Retesting with the same test . This can be done particularly well 
in such tests as sensory limen or discrimination tests, in which it is nj>t 
likely that the subject will remember and recognize the individual 
items. As with parallel forms, it is necessary for the experimenter to 
control both the length of time between tests and the activity of the 
subjects during the interval so as to rule out practice, fatigue, and sim¬ 
ilar effects. If the test has a distinct speed component, it is very un¬ 
likely that the same form can be repeated with no score change such as 
those produced by either practice or fatigue. However, if the statistical 
criterion for parallel tests is routinely applied, such score changes are 
detected immediately. 

3. Some variant of the split-half or parallel subtests method . If only 
one form of the test is available, and it is not possible or desirable to 
repeat the test with the same group of subjects, it is possible to consider 
using one of these methods. Such methods cannot be used unless the 
test has a liberal time limit. It is also desirable, but not always essen¬ 
tial, that the test have a large number of independent items. If three 
or more parallel subtests are used, then again the criteria presented in 
Chapter 14 will show whether or not parallel subtests were obtained. In 
many instances, first versus second halves or odds versus evens will form 
satisfactory parallel subtests. The most certain method, however, is to 
match groups of items on statistical and other criteria available, and 
then to assign randomly each member of a group to a different parallel 
form. This matching and randomizing method gives excellent results. 
If information is available for such matching before the items are 
arranged in the test, it is possible to be certain that either the successive 
halves (thirds) of a test or the alternate items will be parallel subtests 
of representative items from the total test. 

In studies comparing these three methods of obtaining test reliability, 
it is generally found that the parallel forms correlation is the lowest and 
the “corrected odd-even” reliability is the highest. 

When either of the first two methods is used, the correlation between 
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two sets of scores is the reliability coefficient of the test. When the 
split-half method is used, it is necessary to substitute the obtained cor¬ 
relation (ri 2 ) in the formula 


( 1 ) 


r 'xx 


2ri2 

1 + r 12 ’ 


in order to obtain the test reliability; or to obtain the variance of the 
difference scores ( x x — x 2 ) and use the formula 


( 2 ) 


1 - 


*cT 


or to obtain the variance of each half and use the formula 


(3) 


r" 

• XX 


„ r. + «.. 2 i 


which is identical with equation 2. 

It was shown that, if s x = s 2 , then r' xx = r" xx ; whereas, if s x s 2 , 
then r' zx > r "xx- If three parallel subtests are used, the obtained cor¬ 
relation is substituted in the formula 3r/(l + 2r) to obtain the reliabil¬ 
ity; and correspondingly, for greater numbers of sub tests, equation 10, 
Chapter 8, should be used with K set equal to the number of subtest 


scores. 

The last section presents some special considerations related to the 
reliability of essay examinations. In addition to the usual sources of 
error in objective examinations, inaccuracy of reading contributes to 
lowering the reliability of essay examinations. The correlation between 
scores assigned to the same set of papers by two different readers is 
known as reader reliability. The agreement in means and variances, as 
well as in correlations, can also be assessed by the methods of Chap¬ 
ter 14, provided the examinations are read by three or more readers. 

The correlation between two parallel forms of an essay examination, 
when corrected for the attenuation due to inaccuracy of reading, was 
called the content reliability (r x ^ e ) of the essay examination. It is 
given by 


(24) 


T x\x"% 


‘&"c 


Vrv,,'„r. 




where ry ix ", is the correlation between assigned scores on the parallel 
forms, 

TV,*', is the reader reliability for the x' scores, and 
is the reader reliability for the x" scores. 
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Problems 

1. If the correlation between the first and second halves of a test is .70, what is the 
reliability of the test? 

2. If the odd-even correlation for a test is .83, what is the reliability of the test? 

3. If the reliability of a test is .92, what should be the correlation between two 
parallel halves? 

4. If the reliability of a test is .97, what should be the correlation: 

(a) Between two parallel thirds? 

(ft) Between two parallel fourths? 

5. If the standard deviation of the total scores is 45.3 and the standard deviation 
of the “odds minus evens” score is 12.1, what is the test reliability on the assumption 
that odds and evens are parallel subtests? 

6. If x and y are parallel tests, the variance of x — y is 73.28, and the variance of 
x + y is 841.56. 

(a) What is the reliability of the x -f y score? 

(ft) What is the reliability of the x score? 

(c) What is the reliability of the y score? 

7. Consider each of the following tests that are to be administered to a group of 
high school seniors. For each test, three methods of estimating reliability are being 
considered: (a) the parallel form method, (ft) the odd-even method, and (c) the first- 
second halves method. For each test indicate whether each of the methods is suit¬ 
able or not, and explain why. 

Test A. Two-digit additions, 100 items 2-minute time limit. 

Test B. Current events information questions, 50 items 25-minute time limit. 
Each item answered correctly by 65 to 75 per cent of students. They are not aiv 
ranged in order of difficulty, but in a random order. 

Test C. A series of 30 mathematical reasoning problems, ranging in difficulty from 
some that are answered correctly by 90 per cent .of the students to others answered 
correctly by 20 per cent. The problems are arranged in order of difficulty; one hour 
is allowed for the test. 

Test D . A test of differential brightness acuity. Two lights are flashed simul¬ 
taneously, and the task is to indicate which is brighter. The items have a large 
difficulty range and are presented more or less in order of increasing difficulty. The 
test contains 50 items presented at the rate of one every 5 seconds. 

Test E. A shorthand dictation test, a 1600-word passage read at the rate of 80 
words per minute. 

8. An examination has an objective section (< o ) and an essay section (e). The 
split-half correlation of scores for section o is .80. The corresponding correlation for 
section e is .60. On section e the correlation between the total score given by reader 
A and reader B is .78. Estimate the content reliability: 

(а) For section o? 

(б) For section c? 
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9. The following data on 52 students taking the composition section of the French 
104-5-6 in June 1940 at the University of Chicago were made available by Dr. 
Lawrence Andrus. 


I 

II 

III 

IV 

I 

II 

III 

IV 

1 

41 

24 

17 

27 

58 

32 

26 

2 

40 

22 

18 

28 

35 

24 

11 

3 

73 

40 

33 

29 

55 

31 

24 

4 

39 

20 

19 

30 

62 

32 

30 

5 

74 

37 

37 

31 

68 

32 

36 

6 

49 

31 

18 

32 

55 

30 

25 

7 

35 

20 

15 

33 

62 

29 

33 

8 

59 

33 

26 

34 

67 

36 

31 

9 

44 

28 

16 

35 

53 

30 

23 

10 

51 

25 

26 

36 

54 

29 

25 

11 

55 

26 

29 

37 

61 

32 

29 

12 

54 

31 

23 

38 

68 

31 

37 

13 

36 

25 

11 

39 

58 

30 

28 

14 

74 

35 

39 

40 

60 

29 

31 

15 

48 

29 

19 

41 

84 

43 

41 

16 

52 

28 

24 

42 

39 

20 

19 

17 

66 

42 

24 

43 

37 

22 

15 

18 

73 

39 

34 

44 

56 

28 

28 

19 

59 

33 

26 

45 

56 

31 

25 

20 

50 

26 

24 

46 

25 

15 

10 

21 

25 

18 

7 

47 

72 

36 

36 

22 

60 

31 

29 

48 

33 

24 

9 

23 

60 

34 

26 

49 

41 

26 

15 

24 

65 

34 

31 

50 

66 

35 

31 

25 

41 

18 

23 

51 

38 

27 

11 

26 

65 

35 

* 30 

52 

84 

41 

43 


Column I gives the code number of each student. 

Column II gives the total score for each student. 

Column III gives the score on the first fifty items for each student. 

Column IV gives the score on the second fifty items for each student. 

Number of items — 100 
Maximum number of score points « 100 
Mean raw score * 54.52 
Standard deviation = 13.98 
Number of students «■ 52 

From the foregoing data calculate: 

(a) The reliability coefficient for the total test by the split-half method using the 
Spearman-Brown correction. 
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(6) The reliability coefficient of the total test from the variance of the total score, 
and the variance of the difference between scores on the two halves. 

(c) The reliability coefficient of the total test from the variance of the total score, 
the variance of score on the first fifty items, and the variance of scores on the 
second fifty items. 

(d) The standard error of measurement for this test. 

(e) What are reasonable limits for the true score of a person who scores 51 on the 
total test? One who scores 73 on the total test? 

(/) Estimate the reliability for a comparable test, twice as long as this one, four 
times as long, seven times as long, ten times as long. (See table for Spearman- 
Brown formula in Dunlap and Kurtz, 1932.) 

(i g) If one wished this test to have a reliability of .97, how long would it be necessary 
to make it? 

0 h ) Graph the results of problems / and g. ^ 

(i) Estimate the reliability this test would have if it were applied to a group whose 
scores had a standard deviation half that of the original group. 

(j) Estimate the reliability this test would have if it were applied to a group whose 
scores had a standard deviation twice that of the original group. 



16 

Reliability Estimated 
from Item Homogeneity 


1. Introduction 

As previously indicated, the original approach to the problem of 
reliability by Spearman was based on the correlation of parallel tests. 
Kelley (1942) has pointed out that, according to this concept, the major 
function of the reliability coefficient is to evaluate the judgment of the 
test constructor, to indicate whether or not two forms thought to meas¬ 
ure the same thing do in fact measure approximately the same thing. 
Recently there have been several other approaches designed to measure 
the homogeneity of the items in a test. It should be noted that, if two 
tests have each a high “homogeneity index” while the correlation be¬ 
tween them is low, we have a distinctly disturbing situation. The indi¬ 
cation would perhaps be that a homogeneous field existed but that the 
test constructor did not know enough about that field to construct two 
parallel tests, clearly an unsatisfactory situation. Likewise, suppose 
the “homogeneity index” is very low, but the test constructor is able to 
set up a different form, a parallel test that correlates highly with the 
first form. Here it would seem that the situation is satisfactory. The 
field is not unitary, but the test constructor knows the field well enough 
to set up different tests and have them agree. In short, if a parallel 
form reliability is high, the situation is satisfactory; if the parallel form 
reliability is low, the situation is unsatisfactory, regardless of what hap¬ 
pens to the index of homogeneity. 

One approach to the problem of item homogeneity is to make a 
factor analysis of the inter-item correlations for a test. If there is only 
one common factor, the items are homogeneous. If the analysis reveals 
more than one common factor, it might be desirable to consider dividing 
the test into parts, each of which represented a single common factor. 
Such a method would be extremely laborious for any very long test. 
Carroll (1945) has shown that the point biserial correlation cannot give 

220 
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a one-factor solution if the items differ in difficulty. He has suggested 
that the two-by-two scatter plot for each pair of items be corrected for 
the effect of guessing and that the tetrachoric correlation coefficients 
then be used for a factor analysis. Such a suggestion will probably not 
be widely adopted until very much more rapid methods of obtaining 
correlations and factor analyses are available. 

The use of methods of analysis of variance for the solution of the 
problem of reliability has been suggested by many writers; see Johnson 
and Neyman (1936), Jackson (1939), (19406), and (1942), Hoyt (1941), 
and Alexander (1947). Jackson and Ferguson (1941) show how the 
analysis of variance can aid in separating out and assessing the various 
sources of unreliability in a test. The consideration of such methods 
and such a detailed analysis of sources of unreliability are beyond the 
scope of an elementary discussion. 

It has also been shown that the homogeneity of a set of items may be 
assessed by comparing the standard deviation of the test with the stand¬ 
ard deviation to be expected hom items of the same difficulty that are 
correlated zero, or correlated perfectly with each other. This approach 
has been developed by Loevinger (1947). 

Kuder and Richardson (1937) developed several methods of assessing 
the homogeneity of a set of items without the use of a parallel test. 
Further studies of these methods have been presented by Richardson 
and Kuder (1939), Dressel (1940), and Kaitz (1945a). The use of the 
Lexian ratio in measuring reliability was suggested by Edgerton and 
Thomson (1942). 

Guttman (1945) and (1946) has presented a theory of reliability in 
terms of estimation of lower bounds for reliability. His view is that the 
upper bound for reliability of any test is always unity, and that fre¬ 
quently a lower bound can be determined that is far enough from zero 
and near enough to unity to be of use. He has presented a number of 
different lower bound estimates for both quantitative and qualitative 
data. 

In this chapter, we shall present only two of these alternative meth¬ 
ods of estimating reliability by means of an index of item homogeneity 
which does not require the division of a test into parallel subtests. 
Both methods require the number of test items (K) and the standard 
deviation of the test (s x ). One method requires, in addition, the average 
item variance; the other requires the test mean. 

2. Reliability estimated from item difficulty and test variance 

If we assume two tests that are parallel item for item, the intercorrela¬ 
tion (reliability) of these tests may be written as follows. t Let us use x 
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for one test and y for the other test. The items are designated by sub¬ 
scripts from 1 to K . 

Equation (1) 


N 


TZxZy = 


2 Z ( X U + x 2i 4 - H %Ki)(yii + V2 *+'••+ VKi) 

i=l 




2 (Zli + %2i H-h 

i=l 


i ) 2 . i: 

\ i=i 


+ 2/21 H-h 2//sTi) 2 


Expanding and collecting terms in the numerator, and noting that 
since we are dealing with parallel tests the two factors in the denomina¬ 
tor are equal, we have 

K N N K K 

23 23 x giVgi + 53 23 53 (g * h) 

_ e=i »'-i _ ,=i g=i A-i 

W V X N K K 

23 23 x m 2 + 53 53 53 (? ^ *) 

i=l g=l i=i g=l A=i 


Dividing through by N and writing the results in terms of correlation 
and standard deviations, we have 

K K K 

r xgVg S Xg S Vg t E Z r xgVh S Xg S Vh (9 7* h) 


(3) 


_ g=l g=l h=l 

r^xZy - K K K 

2 S Xg 2 *1" E E r x g x h s xg s x h (g 9* h) 

g=l g=l fc=l 


Since s Xf £nd s Vg are standard deviations of parallel items in two forms 
of the test, we may assume that they are equal. Likewise, since r XgVg is 
the correlation between two parallel items, it is a reliability coefficient, 
and may be written r gg . In general, we may now drop the distinction 
between x and y and retain only the subscripts that denote whether we 
have the same or a different item. This change gives 


(4) 


K K K 

23 r ** s * 2 + 23 23 r gh SgS h (fir 7 * h) 

__ g=l g=l &=1 

rsxSy * £ x 

23 Sg 2 + 53 23 rghSgSk ( g ^ h) 

g=l g=l A =1 


It will be noted that the denominator is the variance of the total test. 
Since the numerator and denominator are alike except for the first 
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term, we may designate the total test variance by s z 2 and write 






(5) 


rzxXx = 






Since we do not actually have two tests that parallel each other item 
for item, it is necessary to make some assumption in order to have a 
value for r gg . The simplest and most direct assumption is that the 
average r gg s g 2 , which is the covariance between parallel items, is equal 
to the average ( r g hS g Sh ), which is the covariance between non-parallel 
items. That is, 

K K * 

jr ^Li X] 


( 6 ) 

Since 

(7) 

we may write 

( 8 ) 


V' r , 2 _ *=! A =i 
/Li ' ««*« — 


«=1 


K - 1 


<S*h). 


% = E S « 2 + E E r ShSgSA, 

S~l g=l h=l 


K 


Ta gS, 


gg»g 


2 _ 


- E 

g=i 


*-i K- 1 

Substituting this value in equation 5 gives 

o 2 


s x — 53 s £ 2 + 




g= 1 


(9) 


I'-LxZx — 


g=l 


K - 1 


We may write r xz for the reliability of the test and simplify equation 9 to 

K 

r k 1 

rxx ~lK- iJL 1 ~ _ 77 


( 10 ) 


where r xx is the reliability coefficient of the test, 

K is the number of items in the test, 

s g 2 is the variance of item g (equals p g ( 1 — p g ), where p is the 
percentage getting the item correct), and 
B 2 is the test variance. 
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It should again be noted that the only assumption made in deriving 
this equation was that the average covariance among non-parallel items 
was equal to the average covariance among parallel items. 

In terms of item difficulties or percentage passing a given item, we 
may write 


(ID 



H (Ps - Pg 2 ) 

g=l 




where p g is the proportion passing a given item, and all other terms air 
defined as in the preceding equation. 


If the test variance , the number of items in a testj and the per¬ 
centage of persons correctly answering each item are known , 
and if the test score is the number of items answered cor¬ 
rectly , a lower bound for the reliability coefficient of the test 
is given by equation 10 or equation 11. These equations are 
based on only one assumption , that the average covariance 
between non-parallel items is equal to the average covariance 
between parallel items. 


It should be noted that formulas 10 and 11 are identical with “for¬ 
mula 20,” derived by Kuder and Richardson (1937), with formula 29, 
in Chapter V of Jackson and Ferguson (1941), and the formula for L 3 
given by Guttman (1945). However, the assumptions used for the 
derivation were radically different in these three papers. Kuder and 
Richardson assumed that all inter-item correlations were equal. Jack- 
son and Ferguson, however, showed that it is necessary only to assume 
that the average covariance between parallel items is equal to the aver¬ 
age covariance between non-parallel items. They also showed that the 
assumptions made by Kuder and Richardson (1937) were not only 
unnecessarily restrictive, but were in some cases internally inconsistent. 
Guttman demonstrated that the value given by equation 10 is a lower 
bound to the reliability coefficient. 

If the item-test correlations or the inter-item correlations are known, 
it is possible to use this information in more complex formulas to obtain 
better estimates of the test reliability. Such formulas have been given 
by Kuder and Richardson (1937) and by Guttman (1945). These 
formulas are not given here since it seems that formula 10 is usually 
quite satisfactory. As a result of some empirical studies, Richardson 
and Kuder (1939) recommended their “formula 20” as the best one 
to use. 
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3. Reliability estimated from test mean and variance* 

Let us consider the simplified Kuder-Riehardson formulation that is 
obtained by assuming that all items are of the same difficulty. In this 
case it is possible to estimate 2s g 2 or 2 (pq) from the mean of the test. 
The number of subjects getting item g correct is Np g . The sum of these 

K 

terms over all the items is the total number of correct answers, N 23 V&- 

Since the total number of correct answers divided by the number of 
subjects is the mean of the test, we have 

K 

(12) M x = Z P„ 

or using p to designate the average item difficulty, we may write 

(13) M x = Kp. 

Likewise the sum of the variances (2s 2 ) may be written 

(14) 2p - 2p 2 = Kp - A> 2 . 

If all items are the same difficulty, the average of the squares will be 
equal to the square of the average, and we may write 

M 2 

(15) 2p - V = M x - ~ 

K 


If we substitute this equation in the numerator of equation 11, we have 


( 16 ) 


Txx — 


K - 1 



where r xx is the reliability of the test, 

K is the number of items in the test, 

M x is the test mean, and 

s x 2 is the variance of raw scores on the test. 

This formula is identical with the Ivuder-Richardson “formula 21.” 
The derivation given here uses the same assumption as equation 10 or 
equation 11 plus the assumption that all item difficulties are equal. 
Formula 16 has the advantage of being very simple to calculate, since 
it uses only the mean, variance, and number of items. Also it has the 
advantage of being a lower bound so that we can by the use of this 
formula quickly satisfy ourselves that a given test is performing fairly 
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well. This formula gives an exact figure for the reliability if all items 
are of the same difficulty level. If the items in an examination have a 
wide difficulty range, formula 16 gives an unsatisfactorily low figure 
for the reliability. 

If only the mean , standard deviation , and number of items 
in a test are known, and if the test score is the number of 
items answered correctly , a lower bound for the reliability co¬ 
efficient of the test is given by formula 16. If all the test items 
are of equal difficulty , this value will be identical with that 
given by formulas 10 and 11; otherwise it will be smaller. 


4. Summary 

If score on a test is the number of items correctly answered, and if 
we know the number of items in the test, the test variance, and the 
percentage of persons answering each item correctly, the test reliability 
may be calculated by 



where r xx is the reliability coefficient of the test, 

K is the number of items in the test, 

$ x 2 is the test variance, and 

s g 2 is the variance of item g, which equals p g ( 1 — p g ), where p 
is the percentage of persons answering the item correctly. 

Substituting for s g its value in terms of p g , we have 



These formulas are based on the assumption that the average covariance 
between non-parallel items is equal to the average covariance between 
parallel items. Since, in general, the former is smaller than the latter, 
the values given by equations 10 and 11 will, in general, be underesti¬ 
mates of the reliability. 

Using only the test mean, variance, and number of items, we may 
estimate the test reliability by 
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where M x is the test mean and the other terms have the same definitions 
as in equation 10. If all the test items are of equal difficulty, equation 16 
will be identical with equations 10 and 11. Usually equation 16 gives 
values that are considerably less than the values given by equations 10 
and 11. Like equations 10 and 11, equation 16 may be used only when 
the score on a test is a linear function of the number of items answered 
correctly. 


Problems 


1. We have the following information on a test, 
formulas 11 and 16. 


Item 

P 

1 

70 

2 

90 

3 

88 

4 

94 

5 

77 

6 

86 

7 

69 

8 

85 

9 

46 

10 

77 

11 

74 

12 

60 

13 

30 

14 

50 

15 

85 

16 

90 

17 

35 

18 

25 

19 

47 

20 

91 

21 

27 

22 

23 

23 

34 

24 

32 

25 

65 


15.50 


Find the reliability by using 

* 


M - 15.5 
s « 5.6 


p is the percentage of persons answering each item correctly. (Ample time was 
allowed for this test so that all 500 persons attempted each item.) The score was 
number of items answered correctly. 
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2. Use the following information to obtain the test reliability by formulas 11 and 16. 


Item 

V 

N a 


1 

73 

500 


2 

68 

500 


3 

90 

500 


4 

91 

500 


6 

70 

500 


6 

80 

500 


7 

77 

500 


8 

39 

500 


9 

61 

500 


10 

72 

500 


11 

71 

500 


12 

66 

500 


13 

37 

500 


14 

50 

500 


15 

85 

500 

M t - 15.3 

16 

49 

500 

S* = 5.1 

17 

70 

496 


18 

65 

495 


19 

57 

490 


20 

54 

488 


21 

16 

475 


22 

15 

465 


23 

15 

455 


24 

20 

450 


25 

23 

440 


26 

35 

420 


27 

34 

410 


28 

30 

400 


29 

28 

400 


30 

33 

390 



Use the item analysis data to determine the reliability of the test. (Note that p is 
not percentage of total group answering item correctly.) 

Score is number of items answered correctly. 

N a is number of persons attempting each item. 
p is percentage of persons attempting the item who answer it correctly. 
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8. The following data on 52 students taking the composition section of the French 
104-5-6 in June 1940 were made available by Dr. Lawrence Andrus of the University 
of Chicago. 

Column A gives the item number. 

Column B gives the proportion of entire group passing. 

ARAB ARAB 

1 .13 26 .42 51 .04 76 .56 

2 . 54 27 . 56 52 . 69 77 . 25 

3 . 42 28 . 83 53 . 44 78 . 38 

4 . 56 29 . 50 54 . 79 79 . 25 

5 .73 30 .79 55 .25 80 .44 

6 .27 31 .69 56 .19 81 .75 

7 . 65 32 . 27 57 . 52 82 . 29 # 

8 1.00 33 . 52 58 . 52 83 . 33 

9 . 25 34 . 62 59 .19 84 . 37 

10 . 21 35 . 75 60 . 77 85 . 71 

11 .54 36 .81 61 .60 86 .62 

12 .38 37 .75 62 .54 87 .69 

13 . 60 38 . 44 63 . 58 88 . 25 

14 . 60 39 . 44 64 . 92 89 . 67 

15 . 77 40 1.00 65 . 56 90 . 58 

16 .62 41 .98 66 .42 91 .08 

17 .60 42 .71 67 .38 92 .90 

18 .62 43 .67 68 .35 93 .52 

19 . 88 44 . 63 69 . 44 94 . 81 

20 .42 45 .79 70 .27 95 .83 

21 .83 46 .77 71 .56 96 .81 

22 .62 47 .58 72 .52 97 .44 

23 . 60 48 .63 73 . 56 98 .08 

24 . 77 49 .17 74 . 23 99 . 79 

25 . 63 50 . 06 75 . 65 100 . 56 

N - 52. M - 54.52. s - 13.98. 

From the foregoing data: 

(а) Estimate the reliability of the test, from the test variance and average item 
variance. 

(б) Estimate the reliability from the test mean and variance. 

(c) Compare these values with those found for the same set of test papers in 
problem 9, Chapter 15. 



17 

Speed versus Power Tests 


1. Definition of speed and power tests 

In this chapter the problem of distinguishing between speed and power 
tests will be considered, and a criterion will be proposed for determining 
the extent to which a given test approaches a “pure speed” or a “pure 
power” test. This material is presented as a suggestion toward a differ¬ 
ential rationale for speed and power tests. Relatively little has been 
written on this subject, despite the fact that the problems of item 
analysis, test length, item difficulty distribution, determination of 
reliability, and error of measurement are all quite different for the two 
types of tests. At present most tests are a composite in unknown pro¬ 
portions of speed and power, which makes the development of appropriate 
theorems in test theory more difficult than for the pure type tests. 

First let us define what is meant by a pure speed and a pure power 
test. A pure speed test is a test composed of items so easy that the 
subjects never give the wrong answer to any of them. The answers are 
correct as far as the subject has gone in the test. However, the test 
contains so many items that no one finishes it in the time allowed. The 
subject’s score, therefore, depends entirely on how far he is able to go 
in the time allowed. (We shall assume here that the subjects are in¬ 
structed not to skip any of the items, and that they follow that in¬ 
struction.) 

In order to discuss the speed-power problem symbolically we shall 
distinguish between two types of “errors.” We let 

W designate the number of items for which the subject gives an 
incorrect answer, 

U designate the number of items that the subject does not reach, and 

X designate the total error score on the test. 

That is, X = W + U. 

In a “pure speed” test W will be zero for each subject; hence both 
the mean and the standard deviation of W will equal zero. Also X = U, 

230 
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that is, the subject’s entire score is determined by the number of items 
that he does not attempt; hence the mean of X equals the mean of 
and the variance of X equals the variance of U. 

These are the characteristics of a pure speed test. Any actual test 
then may be said to approach being a pure speed test to the extent that 
Mw, the mean, and sw, the standard deviation of the TF’s approach 
zero, and the mean and the standard deviation of the U 's approach the 
mean and standard deviation of the total number of errors (W + U). 

In a “pure power” test all the items are attempted so that the score 
on the test depends entirely upon the number of items that are answered, 
and answered incorrectly. (Again we assume that by careful directions 
none of the items is skipped.) In the pure power test, U will be zerp 
for each person; hence the mean and standard deviation of U will be 
zero. Since for each subject X = W, the mean and standard deviation 
of X equal the mean and standard deviation of W . Again we should 
note that these characteristics hold strictly only for the pure power test. 
To the extent that these conditions are approximated, the test ap¬ 
proaches a power test. 

As has already been pointed out, the split-half (especially the odd- 
even) reliability cannot be used for any test except a pure power test. 
As the speed factor enters more and more into the determination of 
test score, the higher the odd-even reliability will become. Let us now 
consider a criterion that will indicate when a test is sufficiently close to 
a pure power test so that we may be relatively certain that the odd-even 
reliability or some other split-half reliability will not be spuriously high 
or low. Likewise a criterion for a pure speed test should indicate when 
a test is primarily a speed test so that the variability due to item diffi¬ 
culty or to carelessness in answering items is negligible. Depending on t 
whether speed and power are positively or negatively correlated, the 
test-retest reliability of a test that involves both elements is likely to 
be higher or lower than the reliability of a test that involves only one 
element. Therefore, if we wish to measure speed in a given function it 
is important to make certain that we are dealing only to a negligible 
extent with a test involving power. 

2. Effect of unattempted items (or wrong items) on the 

standard deviation 

First let us consider the problem of determining whether the standard 
deviation of a test is influenced mainly by the speed or the power factor 
in the test. As in previous derivations, Mx = Mw + Mu so that we 
may designate the deviation scores by lower-case letters, and write 

x ■*» w + u. 
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Taking the standard deviation, we square, sum, and divide by N, 
obtaining 

2x 2 2(w + u ) 2 

U N N 

Expanding, we have 


( 2 ) 


2x 2 _ 2 w 2 2u 2 22u>u 
~N ~ ~W + ~N~ + ~~N~ 


This may also be written 

(з) 8x “ Syj I 8^ I 2r wu s w s u . 

In a pure power test, all subjects will finish, the variance of u will be 
zero; hence the last two terms will be zero. In a pure speed test there 
will be no errors made by one who attempts an item; hence the first 
and the last terms of the right-hand expression will be zero. 

In a pure speed test, s w = 0 and s u = s x . In a pure power test, 
8h 0 and 8y) s x * 

It should be noted that r wu may well be negative. The subject who 
omits the fewest items will have answered the most items. Therefore, 
he may well have a great many errors, thus tending to make the subject 
with many actual errors (w) the one with the fewest unattempted items 

(и) . For this reason, if we do not wish to calculate both s„ and 8 W , it is 
necessary to rely on the one likely to be zero, or near zero. 

For example, if r„„ is —1, s x = s w — s u or else s x — s„ — 8 W . In 
either case it is possible that both s w and s u would be larger than s x , 
thus making the use of either one alone unsuitable as an indication of 
the magnitude of the other variance. 

On the other hand, if either 8 W or s u is zero, or very nearly zero, the 
other component must be very nearly equal to s x . The two extreme cases 
occfur when r wu == +1 or — 1. In the former case 8 X = 8 W + s u ; in the 
latter | s x | = | 8® — s« |. If s u /s x = 0.1, s w /s x must lie between 0.9 
and 1.1. If 8 u /s x = 0.01, the ratio s w /s x cannot be less than 0.99 nor 
more than 1.01. In such a case we have a test that is primarily a power 
test, in the sense that the test variance would not be changed much if 
the subjects were allowed to finish the test. At one extreme possibility, 
if they were allowed to finish, they would all get all the unfinished items 
wrong, in which event the new s x would equal the old one. At the other 
extreme no one would get any of the items wrong, in which case the new 
8 X would be equal to the present s w , which, as we have seen above, must 
be within 10 per cent of s x if the ratio «»/«-% *= 0.1. 

Thus from the viewpoint of effect upon the standard deviation of a 
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test, we may say that a test is essentially a speed test if 8 w /a x is very 
small; and that a test is essentially a power test if s u /s x is very anmll 

For a speed test 8 w /s x is small and 

* v Su 5i/i 

( 4 ) 1 + — > - > 1 - -• 

8 X 8 X S x 

A lower bound for the standard deviation is indicated by 

(5) Su* = S X Sy) | 
an upper bound indicated by 

(6) = s x -j- s w . 

For a power test s u /s x is small and 


$w 


l + ->- 

8 X S x 


> 1 - 


From which we have a lower bound for the standard devia¬ 
tion indicated by 

(8) s W f = s x s u ; 
and an upper bound indicated by 

(9) s w " = s x -f- s u . 

It should be noted that, although statements identical to the foregoing 
ones can be made for a large ratio, they are in that case not very helpful. 
For example, if s u /s x = 0.75, then 


1 + 0.75 > — > 1 - 0.75. 

In other words, s w /s x may be as small as 0.25, which is one-third of the 
ratio s u /s X) or it may be equal to 1.75, which is more than double the 
ratio s u /s x . 

3. Effect of unattemptcd items on the error of measurement 

The error of measurement for t he total score x is equal to the standard 
deviation times the quantity 'V (1 — r). Since we have already con- 
si dered th e standard deviation let us consider the other quantity, 
\/(1 — r). Again we define the reliability as the correlation between 
two parallel forms, designated 1 and 2: 

2x10*2 

T X\X 2 = A r 


(10) 
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Let us first write out the numerator in terms of the component scores 
w and u: 

( 11 ) 2X\X2 = 2(w 1 + Ui)(W2 + U 2 ). 

Expanding, we have 

( 12 ) 2 x 1 X 2 = 2wiw 2 + 2u x u 2 + 2w 2 ui + 2w\u 2 . 

Using reliabilities and intercorrelations, and noting that variances of 
parallel forms are equal, we have 

(13) 2 * 1 X 2 = Nr WlWi s u ? + Nr ulUi s u 2 + 2Nr wu s w s u . 

Substituting equation 13 in equation 10, and setting the two variances 
in the denominator equal to each other, we have 

“b 

(14) r xix , =- - - 

$x 


Substituting equation 3 in equation 14, we have 


(15) 


TwiW2$W 2 + TuiUlpU "f“ 2 T WU S W S U 
Sy) ~b ~b 2r wu^w^u 


Using equation 15, we may write 


(16) 


1 ^X\X2 


o (1 ^*101102) “b (1 Tuiuz) 

Sy) H” Sy "b 2 Tf‘y)uSy)Sy 


From equation 3 we see that the denominator of equation 16 is s x 2 . 
Making this substitution gives 

(17) ®x^(l ^xx) = s w (1 r ww ) *b s u (1 r uu ) f 


where s^ 2 is the variance of the w-score (number of items answered 
incorrectly), 

8 U 2 is the variance of the u -score (number of items unattempted 
at the end of the test), 
s x 2 is the variance of x , which equals w + u, 
r ww , r UU) and r xx represent the reliabilities of these scores. The 
formula is correct for either split-half or alternate form 
estimates of reliability. 





If x is defined as w + u, the error variance for the x-score 
is equal to the error variance of the w-score plus the error 
variance of the urscore . 
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It should be noted that in any test that has both the w and u com¬ 
ponents, the split-half reliability of u is unity; hence the last term of . 
equation 17 is zero for any split-half reliability. A valid estimate of 
this second term is given only by a test-retest reliability. For a pure 
power test, the variance of u is zero; hence a stepped-up split-half 
correlation is a valid estimate of its reliability. If a test is primarily a 
power test, that is, if the variance of u is negligible, the stepped-up 
split-half correlation is still a reasonable estimate of the test reliability. 
However, when a test is partly speed as well as power so that the second 
term of equation 17 is not negligible, or when a test is primarily a speed 
test so that this second term is the major component of the error of 
measurement, the error of measurement obtained from a split-half* 
reliability is too low. In such a case a test-retest or a parallel form 
reliability must be used. Whenever the standard deviation of u is 
much greater than two or three tenths of the standard deviation of w, 
a split-half correlation is an unsafe basis for estimating the test reliability. 

If a test is primarily a power test, it is possible to use the split-half 
reliability to estimate a range for the error of measurement. Setting r„„ 
equal to zero in equation 17 will give an upper bound for the error of 
measurement; setting it equal to unity (as is done in the split-half 
reliability coefficient) will give a lower bound for the error of measure¬ 
ment. For any split-half reliability in which the untried items are 
divided equally between the halves it is necessarily true that 

(18) s* 2 (l - r xx ) = s„, 2 (l - r ww ). 

Since the error of measurement would be larger if the subjects had been 
allowed to finish the test, but could not increase by more than the 
value of s„ 2 (l - r„„) when r uu = 0, we may use s 2 mca s. to represent the 
error variance of the test and write 

(19) S x 2 (l — T xx ) “I" S u 2 > S 2 meas. ^ (1 l*xx), 

where the terms have the same definition as in equation 17. However 
equation 19 applies only in the case of a split-half reliability estimate 
for a test that is primarily a power test so that s„ 2 will be a relatively 
«mall possible addition to the error of measurement. 

If a test is primarily a speed test, a test-retest or an alternate form 
reliability must be used. The error of measurement calculated from this 
reliability will have the two components indicated by equation 17. For 
a pure speed test, the first of these components would be zero because 
Su, is zero for a speed test. Regardless of the magnitude of s w , the error 
of measurement calculated from a test-retest or an alternate form 
reliability correctly represents the functioning error of measurement of 
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the test. If the directions for the test and the attitude of the subjects 
were changed so that no errors (uHScore) were made, the new error of 
measurement would be different from the old one. It does not seem 
feasible at present to try to estimate the possible magnitude of this 
change. 


4. Effect of unattempted items on the reliability 
Equation 18 of Chapter 4 may be rewritten 

(20) S 2 meas. = S x 2 ( 1 - T xx ). 

Solving equation 20 explicitly for the reliability coefficient, we have 

s 2 

/oi\ * * meas< 

(21) r xx = 1-— • 


If we use equation 21 and substitute various values of the error of 
measurement and the standard deviation as indicated in the two pre¬ 
ceding sections, we shall obtain some possible upper and lower bounds 
for the reliability coefficient of power tests that are partially speeded 
and speed tests that are in part power tests. 

For a test that is primarily a power test, a possible estimate of a lower 
bound for the reliability coefficient may be found by using a small 
estimate of the standard deviation as given in equation 8 and a large 
value for the error of measurement as indicated in the first expression 
of equation 19. Substituting these two values in equation 21 gives 


( 22 ) 


r'xz 




«* 2 (1 - r xx ) + sj 
( s x ~ s u ) 2 


Dividing through by s x 2 , setting H for s u /s x , and writing the expression 
with a common denominator gives 


(23) 


I*'xx — 


1 - 2H + H ‘ 2 - 1 + r„ - H ‘ 2 
1-2H + H 2 


Simplifying and ignoring the term // 2 , we have 


(24) 


T xx — 


r xx — 2// 

1 - 2H ' 


Using a stepped-up split-half correlation for the reliability of a 
p^ially speeded power test will certainly give a figure higher than the 
iwiual reliability of the test so that the obtained reliability coefficient 
thalf has been designated by r xz may be regarded as an upper bound for 
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the reliability coefficient. We may use equation 24 as a lower bound 
and designate the correct reliability by R, obtaining 


(25) 


r xx 


> R> 


r xx — 2 H 
1 - 2H ’ 


where r xx is the stepped-up split-half correlation, 

H is s u /s x , the ratio of the standard deviation of the “number 
unattempted score” to the standard deviation of the 
number not answered correctly (u + w), and 
R is the reliability of the test. 


It should be noted that for many tests the right-hand term of equation 
25 will give a lower bound that is distressingly low, and may well be far' 
lower than an alternate form reliability for the test. However, it seems 
probable that, if this lower bound turns out to be satisfactorily high, 
there can be little doubt that the reliability of the test will be satis¬ 
factory. Beginning with equation 15, there are various other assump¬ 
tions that may be made regarding what might happen if a parallel form 
instead of a split-half reliability had been used. An experimental 
investigation of the typical behavior of the various terms in equation 15 
is probably needed in order to determine which assumptions are most 
appropriate. 

Another possible lower bound for the reliability of a somewhat speeded 
power test can be illustrated with equation 15. We may assume that 
for a split-half reliability all the terms on the right-hand side of equation 
15 are correct except for r uu s u 2 . In a split-half correlation, this term is 
clearly too large, since r uv is necessarily unity. If the term s„ 2 is sub¬ 
tracted from the numerator of equation 15 this will have the effect of 
assuming that r uu is zero, and may well give a good lower bound for the 
reliability of the test. Let us refer to equation 14. The numerator may 
be expressed as r xx 8 x 2 . If we subtract s tt 2 from this, and divide by the 
variance of x, we shall have a reliability figure under the assumption 
that r uu is zero instead of unity. Thus we have 


(26) r" xx - 

Writing H for s u /s x , we have 


T X x$x 


(27) 


r" 

• XX 


= r x . 


-H 2 , 


where the terms have the same definition as in equation 25. For this 
new lower bound we may write 

( 28 ) r xx >R'> r xx - H 2 . 
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If a power test is partially speeded, and a split-half reli¬ 
ability ( r xx ) has been calculated, equation 25 or 28 may be 
■used to give some idea of the extent to which r xx is an over¬ 
estimate of the test reliability. 

Some evidence has been presented (see Gulliksen, 1950) indicating 
that equation 28 may be satisfactory provided II 2 is less than 0.2. 

If a test is primarily a speeded test, an alternate form or a test-retest 
reliability must be used. Such a reliability correctly represents the 
functioning reliability of the test. If only the number of items un¬ 
attempted is used as the score, we have a relatively pure measure of 
speed; if the number correct is used, both speed and accuracy enter 
into the score. By using both these scores, we can determine the relative 
reliability of speed alone and of speed together with accuracy. Since 
this problem is purely experimental, no further theoretical discussion 
will be given here. 

5. Estimation of the variance of the number-unattempted score 

from item analysis data 

The preceding discussion has been in terms of the number of unat¬ 
tempted items because it is possible to obtain the variance of this score 
from item analysis data which gives number answered correctly, number 
answered incorrectly, and number of persons not reaching the item. 
Thus, if item analysis data are available, the variance of the “number- 
unattempted score,” hence the ratio II, can be calculated without 
rescoring of the papers. 

Let us use K to designate the number of items in the test and y e to 
designate the number of persons who did not reach item g. It is clear 
then that y g+ i ^ y g , since all persons who did not reach item g did not 
reach any subsequent item. We shall also assume that y t = 0 for 
the items near the beginning of the test. It is clear that 

(29) Z Ui = E y e . 

1 1 

That is, the sum of all the unattempted scores may be obtained by sum¬ 
ming over persons or over items. Therefore, 

N K 

£ U, £y, 
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That is, the average unattempted score is equal to the sum of the 
number unattempted from item 1 to item K, divided by the number of 
'persons taking the test. 

In order to obtain the standard deviation of the number-unattempted 
score, we shall use the usual formula for standard deviation written as 
follows: 

N 

L u? 

(31) $„ 2 = - M v 2 . 

Since Mu is given by formula 30, we know all the terms in this equation 
except St/ 2 . Let us use n u to indicate the number of persons making 
an unattempted score of U ; then 


n\ 

= Vk 

^4 

1 

* 

1 

n 2 

II 

* 

1 

<N 

1 

* 

5 a 

1 

n 3 

1 

* 

5 * 

II 

- vk -3 


(32) ri u — Uk—u- f-i V k —u 


n K -2 = 2/3 - 2/2 

n K -i = 2/2 - 2/i 

n K = 2/i- 

Many of the terms in equation 32 will be zero, since all the subjects 
will presumably attempt many of the earlier items in the test. 

In order to obtain St/ 2 it is necessary to multiply the first frequency 
by l 2 , the second by 2 2 , and so on. The sum of the resulting products 
is St/ 2 . Using equations 32 to write this sum of products, we obtain 

( 33 ) 2 t / 2 = 1 \y K - Vk-i) + 2 2 ( VK -i - y K -2 ) 

+ 3 2 (yx_2 — Vks) 4-h u a (yK-u+i — Vk-u ) 

+ •■•+(*- 2) 2 ( 2/3 ~ 2 / 2 ) + (K - 1 ) 2 (y 2 - yi ) + K 2 Vl . 
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As pointed out in connection with equations 32, when one of the 
y-terms is zero, all subsequent terms are zero and may be omitted from 
the summation. 

Removing the parentheses in equation 33 and performing the sub¬ 
tractions gives 

(34) SI/ 2 = y*(l 2 ) + j/k_i( 2 2 - l 2 ) + y*_ 2 (3 2 - 2 2 ) +••• + 

VK-U+ 1[« 2 - (u - l) 2 ] + y K -u[(u + l) 2 - M 2 ] 

+ ---+y 3 [(^-2) 2 - (K-3) 2 ] 

+ VaKK - l) 2 — (K — 2) 2 ] + yi [(K 2 - (K - l) 2 ]. 

As before, this series is continued until all subsequent y* s are zero. 
Since the difference of successive squares constitutes a series of con¬ 
secutive odd numbers, equation 34 can be written as 

(35) St/ 2 = lyx + tyx-i + 2 4-h (2 u — \)yK-u+\ + 

(2u + l)y K -u + • • ■ + (2 K- 5 )y 3 + (2 K - 3)y 2 + (2 K - l) Vl . 
The sum of this series may be written 

N K—l K—l 

(36) £ U i 2 = 2 H UVK-u + 2 VK-u- 

1=1 u=0 u=0 

The summation begins with u = 0 because the first term is yx> that is, 
yx-u, where u = 0. For the sake of completeness, the summation is 
indicated as extending to (K — 1), but in any computational problem 
many terms will be zero and can be omitted. From equation 30 we see 
that the last term of equation 36 is equal to NMu- Substituting equa¬ 
tion 36 in equation 31, we have the solution, 

/ 1 \ K ~ l 

(37) s u 2 = ( -tz ) 2 uVk-u + M v - Mu 2 , 

\N/ w «o 

where s u is the standard deviation of the number-unattempted score, 
Vk-u is the number of persons not reaching the (K — w)th item, 

N is the number of persons, and 
M u is given by equation 30. 

By using equations SO and 37, s u , and hence H (for use in 
equations 34, 35, and 38), may be calculated directly from 
item analysis data showing the number of persons not reach¬ 
ing each item . These equations wiU enable us to avoid the 
labor of rescoring the answer sheets in order to obtain s u . 
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6. Summary 

In discussing the speed-power problem, the following symbols were 
used: 

W (wrongs), the number of items for which the subject gives an 
incorrect answer, 

U (unattempted), the number of items not reached at the end of the 
test, and 

X the total error score (X = U + W). 

It is assumed that there are no skipped items. 

In a pure speed test Mw and s w are zero. If s w /s x is small (0.1 or 
less), the test may be regarded as primarily a speed test. In this case 

8 to Sfj, Sy) 

(4) 1 + - > - > 1 - -, 

8 X S x S x 

a lower bound for the standard deviation is indicated by 

(5) Sy* = S X *” Sy)f 

and an upper bound for the standard deviation by 

(6) s u " = s x -f- s Wf 

where s w is the standard deviation of the IP-score, 

s u is the standard deviation of the 17-score, and 
s x is the standard deviation of the X-score. 

If a test is primarily a speed test, the reliability and the error of 
measurement must be estimated by means of a test-retest or an alternate 
form reliability coefficient. The reliability and the error of measurement 
so computed will correctly represent the functioning reliability of the 
test under the test directions and administrative conditions that were 
used. If the test conditions are changed in an effort to eliminate the 
IT-score, it does not seem possible to make reasonable estimates regard¬ 
ing what will happen to the error of measurement and the reliability. 

In a pure power test, Mu and s u are zero. If a u /s x is small (0.1 or less), 
the test may be regarded as primarily a power test. In this case 


l+-> —>1 

Sx S x 


a lower bound for the standard deviation is indicated by 

(8) S w ' = 8 X 8 U J 
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and an upper bound for the standard deviation by 

(9) s w " = $ x -f- s u . 

For any split-half reliability that divides the {/-score equally between 
the two halves, it is necessarily true that 

(18) s* 2 ( 1 - r xx ) = s U) 2 (l - r ww ). 

If « 2 meas. is used to designate the error variance as obtained from an 
alternate form reliability or from allowing the subjects to finish the test, 

(19) S x 2 (l - T xx ) + S u 2 > S 2 m ea8. > S* 2 (l - T xx ). 

Two different methods were suggested for estimating a lower bound 
for the reliability coefficient that would be obtained if the students were 
allowed to finish the test, or if an alternate form reliability had been 
calculated. It was found that 
, x r„ - 2 H 

(2S) 

or that 

(28) r xx > R' > r xx - H 2 . 

It should be noted that both these estimates are highly tentative and 
that more experimental work on the relation between speed and power 
needs to be done before we can know which assumptions are the best 
ones to make. It seems now that R ' is better than R . 

In the four preceding equations: 

r xx is the split-half reliability for the X-score, 
r ww is the split-half reliability for the TT-score, and 
H IS Sy/ 8 X » 


In order to guard against the possibility of spuriously high split-half 
reliabilities being reported for partly speeded tests, it would seem desir¬ 
able to present routinely the coefficient H or the lower bound of formula 
28 whenever a split-half reliability is reported. 

It was also shown that the variance of the {/-score could be calculated 
from item analysis data showing the number of persons who did not 
reach each item. If y g designates the number of persons who did 


not reach item g , 


(30) 

— J77 

Mu = (a?) 23 y*> 

and 

/ 

VV/ 

1 \ 

(37) & 

Su 2 = ( 

—) 2 23 u Vk-u + Mu — Mi 
N/ „»o 



243 


Chap. 17] Speed versus Power Tests 

where Mu is the mean [/-score, 

Su is the variance of the [/-score, 

N is the number of persons taking the test, and 
Vk-u is the number of persons not reaching the (K — w)th item. 


Problems 

Data for Problems 1-3 


Item 

V 

N a 

1 

96 

500 

2 

94 

500 

3 

90 

500 

4 

87 

500 

5 

92 

500 

6 

82 

500 

7 

84 

500 

8 

87 

500 

9 

80 

500 

10 

60 

500 

11 

68 

500 

12 

63 

498 

13 

45 

497 

14 

55 

497 

15 

50 

495 

16 

40 

493 

17 

62 

490 

18 

50 

487 

19 

65 

485 

20 

30 

480 

21 

20 

470 

22 

23 

465 

23 

25 

460 

24 

40 

450 

25 

30 

441 

26 

22 

432 

27 

26 

417 

28 

36 

406 

29 

48 

393 

30 

21 

372 




M = 15.9 
s - 8.3 
r = .97 

corrected odd- 
even correlation. 


N a ** number of persons who attempted, that is, indicated some answer (right or 
wrong) for each item. 

p * percentage of those attempting the item who answered it correctly. 
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1. Calculate the standard deviation of the number-unattemptcd score from the 
item analysis data given. 

2. Using the data given, plot the frequency distribution of the number-unattempted 
score, calculate the mean and standard deviation of this distribution to verify the 
calculation in problem 1. 

3. How seriously might the reliability of this test be affected by the speeded 
nature of the test score? 

Data fob Problems 4-6 


Test 

Number 
of Items 

Gross Score 

R 


Mean 

Standard 

Deviation 

A 

180 

73.6 

27.4 

.97 

2.1 

B 

150 

93.7 

16.3 

.93 

7.6 

C 

90 

55.1 

14.5 

.95 

6.3 

D 

70 

30.2 

8.4 

.85 

0.5 

E 

100 

53.6 

11.2 

.82 1 

5.4 


R = 2r/(l + r), where r = the odd-even correlation. 


4. Give a lower bound for the reliability coefficient of each of the tests A to E. 

6. Give the error of measurement for each test, and also an upper bound for this 
error. 

6. (a) For which tests is an odd-even reliability justified? 

(b) Which tests require an alternate form or a test-retest reliability? 
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Methods of Scoring Tests 


1. Introduction 

In this chapter we shall consider two basically different types of 
scoring problems. One type includes the problems in scoring tests where 
each item has one or possibly more answers that are correct (hence are 
scored one point) and other answers that are incorrect (hence receive 
zero credit). The other type of test question is the one for which there 
is no generally “correct” answer. Items used in attitude, interest, or 
personality schedules are of this type, and they present special scoring 
problems. 

Only the simpler methods of scoring tests, based on time or on item 
count, will be considered here. Scoring methods that attempt to deter¬ 
mine “level reached,” such as used in the Binet test, demand a different 
type of theoretical approach, and will not be considered here. The more 
precise absolute scaling methods presented by Thurstone (1925 and 
19276) also require a different theoretical approach, and are beyond the 
scope of this book. 

For purposes of this discussion, we shall consider that the items of a 
test can be divided into four categories, designated as follows: 

R (rights), the number of items marked correctly, 

W (wrongs), the number of items marked incorrectly, 

S (skips), the number of items that have not been marked, but are 
followed by items that have been answered (R or W). It looks 
as if the subject attempted to work the item, and then decided 
to skip it and move on to a later item. 

U (unattempted), the number of consecutive items at the end of the 
test that are not marked. It looks as if the subject did not have 
a chance to attempt these items before time was called. 

There is a possibility that the number of items skipped (S) or the 
number, at the end of the test that are unattempted ( U ) would be useful 
scores. Such scores, coupled with careful test directions, may indicate 

245 
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“cautiousness” or some other similar personality characteristic of the 
subjects. Some subjects may show a consistent tendency to mark items 
and to get them wrong; others may hesitate and skip items, hence have 
a much larger S or U score. No one seems to have investigated such 
possibilities. 

2. Number of correct answers usually a good score 

The way tests are usually handled at present is to frame the directions 
to emphasize that the subjects should answer the items consecutively. 
This means that the number skipped ( S ) will be zero or negligibly small. 
In a power test an effort is made to allow sufficient time for nearly all 
the questions to be answered by nearly everyone. This means that the 
number of items unattempted ( U ) will be small, and the score can be 
regarded as depending primarily on the number of items marked in¬ 
correctly ( W ). In a speed test an effort is made to have no items 
answered incorrectly (}V = 0). The score in this case can be regarded 
as depending primarily on the number of items unattempted ({/). 

If a test is primarily a power test, that is, if S and U are each negligible, 
the score may be the number marked correctly ( R) or its complement 
(W), the number marked incorrectly. If a test is primarily a speed test, 
as is the case if W and S are each negligible, the score should be the 
number marked correctly ( R ) or its complement ((7), the number of 
items unattempted. We shall now consider the cases in which S or U 
(the number of unmarked items) is not negligible for a test that is 
designed as a power test, and in which S or W , the number of items 
marked incorrectly or skipped, is not negligible for a test that is designed 
as a speed test. 

3. The problem of guessing in a power test 

Under ordinary examining conditions, even if S and 17, the number 
of unmarked items, are fairly large, the number of items marked cor¬ 
rectly (B) will turn out to be a suitable score for the examination. This 
will be the case if each student reads each item and honestly tries to 
solve the problem before marking an answer. In general, the student 
who knows the material will solve the problems correctly and more 
quickly; hence he will have more correctly marked answers than the 
student who does not know the material. However, the test constructor 
and the test scorer must bear in mind that it is possible for a student 
who does not know the answer to an item to mark it correctly by chance 
in an objective examination. If practically all items are marked by 
*each of the students, this effect is not a serious one and can be ignored. 
However, sometimes a student may observe that he has only two. minutes 
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left and may feel that it is good policy to mark quickly the last twenty 
or thirty items that he does not have time to read in order to get the 
benefit of a chance score. If the score is taken as number marked cor¬ 
rectly, this student is likely to add more to his score in the last two 
minutes than another equally good student who spent the last two 
minutes attempting to solve one item. 

It should be possible to detect such cases by plotting the number of 
the last item attempted as the abscissa, against R , the number marked 
correctly as the ordinate. On such a plot, the line y = x would indicate 
the locus of scores that were perfect as far as the items were marked, and 
the line y = (1/A)x, where A is the number of alternatives for each 
item, would indicate the locus of the average chance score. For example, 
if the test is composed of five-choice items, the average score from pure 
guessing would be one-fifth of the items correct, and the line y = (l/5)a; 
would be the locus of such scores. If some points with a relatively high 
R , number of correct answers, are near this line, they show that a rela¬ 
tively good R score is made by some persons who are apparently guessing 
the answers to a large number of items. 

A more accurate plot to indicate the presence of good scores made by 
guessing would be to plot the number correct (. R ) as the ordinate against 
the number attempted (R + W) as the abscissa. In the plot previously 
mentioned, the number of the last item attempted is equal to 
(R + W + S). In the new plot the points would be moved to the left, 
and therefore away from the chance line. That is, if the first plot of R 
against number of last item attempted shows no points near the chance 
line, there are no scores that are chance scores. If the first plot shows 
points near the chance line, it may be desirable to make the second plot, 
which is more time consuming, in order to see if we still have a clear 
indication that good R scores can be made near the level of an average 
chance score. 

If we have a test in which some persons are making high R scores 
on the basis of a chance ratio between number right and number at¬ 
tempted, the situation is unsatisfactory, and steps must be taken to 
alter it. If the test is a trial run, it may be possible to shorten the test 
by eliminating some of the items, so that more people will finish the 
test, or it may be possible to lengthen the time allowed for the test so 
that more persons can finish. If either of these changes can be made, 
we may still retain the simple number right score. This score has the 
advantage of being quick to obtain, and of allowing relatively little 
opportunity for clerical errors. However, if the test scores must be 
used as is, or if it is not possible for other reasons to shorten the test 
or lengthen the time, it is possible to consider more complicated scoring 
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formulas that attempt to take account of some of the possible effects of 
guessing. It must be emphasized again that there is no reason for con¬ 
sidering any of these formulas if, for most of the people, R + W is 
essentially equal to the total number of items in the test. Such formulas 
are to be used if, and only if, the number of unmarked items (S + U) 
is fairly large for some persons, and fairly small for others. 

Let B — the number of items left blank. This includes those left 
blank because they were skipped and those not attempted at the end 
of the test. That is, 

B = 8 + U. 

Using K to designate the total number of items in the test, we have 

K = R + W + B. 

One method of dealing with the problem of variation in amount of 
“guessing” from one person to another is to assume that, if there are A 
alternative choices for each item, then if each person had answered every 
item he would have answered l/A-th of them correctly by chance. Let 
Xb designate the score (number right) that would probably have been 
made if every item had been attempted; then 

(1 ) Xb = 12 + Q)5. 

It should be noted that, if any of the items in an examination are so 
difficult and have such plausible distractors that less than l/A-th of 
the persons attempting the item get it correct, equation 1 cannot be 
used. Using it would have the peculiar result that persons not attempt¬ 
ing an item would get a higher score than those who thought about the 
item and answered it. Items of such a high level of difficulty should 
not be used unless there is some special reason that demands their use. 
For example a test of “common fallacies” or “popular superstitions” 
would necessarily contain items that often might be answered correctly 
by less than the expected chance percentage of those attempting the 
item. In such a case, however, it is necessary to allow time for every 
person to answer each of the items so that no correction for effects of guessing 
will he necessary . 

Instead of estimating how many items would have been marked 
correctly if all items had been marked, it is also plausible to approach 
this problem of correction for guessing by attempting to estimate the 
number of items for which the person knew the correct answer. In this 
approach it is assumed that the items left blank are not known so that 
nothing need be added for them. It is also assumed that of the items 
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that the person guessed, l/A-th were (by chance) answered correctly 
and are included in the group of items (R) answered correctly. The 
remaining fraction of the items answered by guessing, (A — 1 )/A, 
represents the items answered incorrectly or the group of items pre¬ 
viously designated W. It follows then that W/(A — 1) is equal to the 
number of items in the R group that were lucky guesses. This number 
should be subtracted from the number answered correctly to give an 
estimate of the number of items for which the answer is known. Let 
Xw designate the number of items for which the answer is known: then 

, v W 

(2) X W = R- - --• 

A — 1 

Again it should be noted that, like equation 1, equation 2 cannot be 
used when items are so difficult that less than a chance proportion of 
those attempting the item get it correct. Hamilton (1950) has utilized 
the regression line to give a more accurate treatment of the problem 
of chance success. 

Equation 1 will always give higher numerical scores than equation 2, 
except for persons making a perfect score. From the viewpoint of 
checking against norms, for example, the two equations are not inter¬ 
changeable. However, from the viewpoint of ranking the students or 
of making correlational studies, the two scores will give exactly the same 
results, since they are perfectly correlated. To show this, we shall write 
the functional relationship between Xb and Xw- 
Since K = R + W + B, we may write 

(3) B-K-W-R. 

Substitute this value in equation 1 and rearrange terms, obtaining 


(4) 


A — l 1 1 

x ‘-— r -j w+ i k - 


If we multiply both sides by A /(A — 1) and subtract the constant 
K/(A — 1), we have 


(5) 


A K W 

- X B - R —- - 

A - 1 A - 1 A - l 


Since the right side of equation 5 is identical with equation 2, we have 
expressed Xw as a linear function of Xb. 

There is another method of dealing with the problem of correcting 
scores on a primarily power test for the effects of possible guessing. The 
method to be proposed guards against the practice of quickly answering 
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all unfinished items just before time is called. The suggestion is to use 
the score 

(6) X v ^R + ~ 

A 

To the number right (R) } add 1/^4-th of the number of items at the end 
of the test that were unattempted by the subject. This differs from 
equation 1 in that no partial credit is given for skipping an item. If a 
subject studies an item, it is desirable to encourage him to give his most 
considered response to that item. Under equation 6 the subject would 
have everything to gain and nothing to lose by marking each item that 
he had time to study. However, there would be no point to rushing 
through during the last minute of the examination and marking all 
remaining items since he would get credit for a chance proportion of 
them anyway. Even the last minute of the examination would, under 
such a system, best be spent in attempting to give a correct answer to 
one more item. 

As far as the writer knows, equation 6 has not been suggested pre¬ 
viously or studied, especially with respect to its effect on the attitude of 
students taking an examination. Perhaps it would avoid some of the 
undesirable examination attitudes that are sometimes engendered by 
objective examinations. It must again be stressed that equations 1, 2, 
and 6 are suggested only when it is not feasible to allow the students to 
finish the examination. The best policy is to insure that practically 
all items are attempted by practically all the students, and then simply 
score number right (R). 

If we depend on IBM machine scoring of tests, the possibility that a 
student will mark several answers to one item must be considered. 
Multiple marks on a single item may occur either because the student 
has misunderstood the directions or because of a belief that the “machine 
will just sense the correct marks.” By ordinary scanning procedures 
it is difficult to be sure of detecting all multiple marking. An easy 
method of dealing with this possibility, and also with the possibility 
that some students will mark items without reading them in order to 
finish the test, is to score the test rights minus an appropriate fraction 
of the wrongs. For hand scoring, this equation is considerably more 
labor than the number right score. For machine scoring, the papers 
are scored in the same time regardless of the scoring formula. For the 
rights minus wrongs scoring, it takes a little longer to make the pre¬ 
liminary adjustments on the machine. 

When marking papers by hand, scoring solely on the basis of number 
correct is usually perfectly satisfactory and considerably more rapid and 
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accurate than using any of the foregoing scoring formulas. However, 
if the test has a short time limit so that many persons do not finish it, 
the scorers must note, and call to the attention of the supervisor, any 
cases in which an unusually large number of items have been answered 
and an unusually large number of errors occur toward the end of the 
test paper. If a moderately high score is made in this way it may be 
desirable to rescore the papers using one of the scoring formulas given 
in this section. 


4. The problem of careless errors in a speed test 


In a speed test none of the equations given in the preceding section is 
appropriate. Giving credit for 1/A-th of the unfinished items (equations 
1 or 6) is inappropriate because the score in a speed test should represent 
the number of items the student is able to do in the allotted time. 
Deducting for items answered correctly by chance (equation 2) is in¬ 
appropriate because in a properly constructed speed test the items should 
not be difficult. Each student should be able to answer each item if he 
studies the item. Thus there is no problem of estimating how many 
items the student knows, as distinct from how many lucky guesses he 
made. The problem is simply, how many items can be solved in the 
allotted time? If time were increased sufficiently each student would 
receive a perfect score. If a speed test is properly constructed, and if 
the students respond properly, the number of skips (S) and the number 
answered incorrectly ( W) will be zero. The test can be scored either 
in terms of number right (72) or number unattempted at the end of the 
test (17). 

However, if a test is designed as a relatively pure speed test, and we 
observe papers in which all the items are marked and the number of 
errors near the end of the paper is much greater than the number near 
the beginning, it may be well to suspect that those students are answering 
the items without studying them in order to capitalize on a possible 
chance score. It is then necessary to rescore the papers using some sort 
of penalty for items marked incorrectly. 

In order to motivate the students to answer each item correctly (not 
to mark items carelessly), and not to skip items, it is desirable to stress 
both these points in the instructions. It may also be well to have a 
small penalty for skips and a larger penalty for errors. This formula 
would be 


(7) 


W 

X s = R- — 
C 


S 

D’ 


where C and D are arbitrary constants, C < D. In order to motivate 
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the students properly, it perhaps is appropriate to make D slightly 
larger than the number of alternatives (A), and C slightly smaller than 
A — 1. For example, in a five-alternative multiple-choice test the 
formula might be chosen as 



Perhaps such a device would encourage the student to mark the item, 
but to be careful to mark it correctly. 

If the penalty for errors or omissions in a speed test is to be used, it 
is probably desirable to study the effect of different penalties on the 
performance of the students. For example, the penalty for errors in a 
typing test is arbitrarily set at ten words per error. What would be 
the effect on student performance if he knew that the penalty would be 
twenty words, or if he knew that it would be five words per error? 

It should also be noted that, if we have a criterion to predict, it is 
unnecessary to bother with these arbitrary scoring formulas. Multiple 
correlation methods will give the best weights to use. 

It is important to adjust the instructions and the motivation of the 
students so that all items are answered, and are answered honestly, after 
some study and thought by the student. If such an attitude is secured 
from all students, then either number right (R), number wrong (IF), or 
number not attempted ( U) could readily be used as the score without 
troubling about any scoring formula. It would be desirable to choose 
the one of the three that had the largest variance for that particular 
test as the final score. Every effort should be made to design the 
examina tion, the instructions, and the motivation of the students to 
discourage the use of various irrelevant tricks that are frequently applied 
in connection with objective examinations. For example, students often 
inquire if there is a “penalty for guessing.” If the answer is “no,” they 
will mark a great many items without bothering to read them if time 
seems short, with the expectation that some will be correct by chance; 
if the answer is “yes,” they will skip items rather than imperil their 
score by guessing. Either attitude is to be avoided since both introduce 
considerations that are probably irrelevant to the student’s knowledge 
and understanding of the field, and these are the things that should be 
measured by the examination. 

5. Time scores for a speed test 

Sometimes the time taken to perform a standard task is the score 
assigned to a test. The larger the score, the poorer the performance. 
In this respect the time score has one property of an error score. It 
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should be noted that in general a time score is not especially suitable 
for group testing. When testing individuals or small groups of three to 
five, the examiner can easily hold a stop watch and mark the time 
when each person finishes. If we wish to secure more uniform timing 
and are satisfied with relatively coarse groupings, such as half-minute 
or one-minute groupings, it is possible to have a single time-keeper for a 
large group of proctors, each of whom is responsible for a small group 
of students. The time-keeper displays a card with a number on it or 
writes a number on the blackboard at stated intervals. This digit 
indicates the number of minutes or half minutes since the test started, 
or the number until the conclusion of the test (if we wish to have a score 
such that higher numbers indicate better performance). The proctor 
then writes this number on the student’s answer blank when the student 
has finished the task. If we are willing to rely on the students, it is 
possible to have the student write the number on his own answer blank. 
It is probable that this method should not be used if only one time limit 
is being taken, since it would be relatively simple for the student to 
write a different number from the one that was actually being shown. 
However, if the test is long and keeps the students working for the entire 
time, it is probably all right to have the student indicate the time of 
finishing each of a number of subsections, if such a time score is desired. 

It should also be noted that time scores could readily have many of 
the properties of number-correct scores. For example, doubling the 
test would give two time scores for each person, and the total score on 
the test would be the sum of the two time scores. If the means, vari¬ 
ances, and covariances satisfied the criterion for parallel tests (see 
Chapter 14), the theorems regarding effect of increased length would 
hold. In applying the theorems previously established to time scores, 
it is essential to see that differential fatigue is not a serious factor. For 
example, the time taken to run a hundred-yard dash is a perfectly good 
score for the hundred-yard dash “test.” It does not follow that the test 
becomes more reliable as it is lengthened. We cannot use four one- 
hundred-yard dashes in succession and then perhaps decide to use a 
five-hundred-yard dash as our final test in order to secure adequate 
reliability. The same consideration applies in lesser degree to any test. 
To a considerable extent the nature of a fifteen-minute test cannot be 
the same as a six-hour test. There are added factors of fatigue, etc., 
entering in; and we usually find six-hour tests divided into two three-hour 
sessions. In other words, when equations on test length are used for 
timed tests, the same precautions previously mentioned apply. Each 
of the new “unit” tests must be “parallel” to each other. This means 
that the test average, the standard deviation, and all intercorrelations 
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must be the same. If this is not found to be true as we lengthen either a 
power or a speed test, the equations relating test length to other param¬ 
eters no longer hold. 

6. Weighting of time and error scores 
Sometimes the question of weighting time and error scores to get a 
single composite score is raised. Again, in general, the best thing to do 



Figure 1 . Illustrating different weightings of time and errors: (a) Equal weights 
for time and errors, (b) Errors receive twice as much weight as time, (c) Time 
receives twice as much weight as errors. 


is to have a criterion and to use the weights that best enable us to predict 
the criterion. The multiple correlation approach is the best one for 
the problem of weighting when an outside criterion is available. 

Often, however, no outside criterion is available. Then the only 
recourse is to fall back upon judgment. A detailed technical method 
for securing and dealing with such judgments is given by Thurstone 
(19316) in his article “The Indifference Function.” If a sufficient 
amount of time and number of judges are available, this method should 
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be used. However, a very crude approach can also be used that utilizes 
a correlation scatter plot of time against errors. This scatter plot may 
be a plot of actual cases or simply an imaginary one where the instructor 
is asked to suppose that cases of various types occurred. 

The instructor in the course, or a group of instructors, or some other 
authority is then asked to judge “which is better” for various pairs of 
students. In this way we can rapidly and crudely determine a family 
of lines that will divide the scatter plot into appropriate zones of increas¬ 
ing ability. Usually these zones can be approximated closely enough by 
a series of parallel straight lines. This fact indicates that a linear com¬ 
bination of time and errors is adequate. The relative weights of the 
two factors are proportional to the slopes of the lines. For Figure ^la 
time and errors are equally weighted, for Figure lb errors have twice 
the weight of time, and for Figure lc time has double the weight of 
errors. 

A similar graphic system can be used for determining rapidly the 
opinion of an expert judge in the field regarding the appropriate weight¬ 
ing of any two subtests in a composite score. 

7. Weighting with a criterion available 

When a definite criterion score is available, we should always use the 
multiple correlation to determine the relative weighting of time and 
errors, of rights and wrongs, or of rights, wrongs, and skips, or any other 
set of subscores that can be obtained from a test. 

If the test score is used to predict a definite criterion, the scoring 
method should be based on multiple-correlation methods to secure 
maximum prediction of the criterion. In principle it is possible to 
determine a separate weight for each item in the test, and to do this in 
such a way as to maximize the correlation of total test score with the 
criterion. In practice, however, this procedure is not usually fol¬ 
lowed, partly because of the very great amount of calculation involved 
and partly because the individual item weights are likely to be very 
unstable unless based on large numbers of cases. For example, Guttman 
reports a study in which a sample was divided into two random halves; 
the first half was scored on the basis of multiple-correlation weights 
assigned to each item, with a resulting multiple correlation of .73. 
When these same weights were used on the second sample the correla¬ 
tion was .04. (See Horst, 1941, page 360.) 

It is often, however, both feasible and desirable to determine the 
multiple correlation and corresponding weights for a few subscores. 
For example, instead of being weighted on the basis of average chance 
success, the wrongs can be weighted to secure maximum prediction of 
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the criterion. This method was given by Thurstone (1919). If the 
rights are to be weighted unity, the best formula is 1 


( 8 ) 
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where Sr is the standard deviation of the rights score, 

8w is the standard deviation of the wrongs score, 
trw is the correlation between “number right” and “number 
wrong,” 

tyr is the validity coefficient for number right, and 
tyw is the validity coefficient for number wrong. 

When the weights of equation 8 are used, the correlation between 
Xw and Y will be the multiple correlation given by 
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The same type of weighting scheme can be used for any two variables 
that are being used to predict a third. For example, if Y is the criterion, 
W is the number of errors, and T is the time score, the best weighting 
of errors in relation to time will also be given by formula 8 if T is substi¬ 
tuted for R. Formula 9 also gives the multiple correlation of this 
weighted time and errors score with the criterion, if correlations involving 
time are substituted for those involving number right in the formula. 


8. Scoring items that have no correct answer 

The items in tests of personality characteristics, attitudes, and inter¬ 
ests frequently do not have clear-cut “correct” and “incorrect” answers. 
Then it is necessary to have a criterion that we wish to predict in order 
to set up. the scoring key for the test. The simplest scoring key is one 
in which each alternative answer is to be scored either 1 or 0. If there 
are only two alternatives, say A and J3, we obtain the average criterion 
score for those choosing the A alternative, and the average criterion score 
for those choosing the B alternative, and assign the score 1 to the alter¬ 
native having the higher and zero to the alternative having the lower 
criterion score. If there are many items, and we desire to eliminate 
some of them, a measure of the significance of the difference, such as 
(Ma — Mb)/8a-b may be obtained, and items with a low value for 
this critical ratio may be discarded. 

1 Formulas 8 and 9 are derived in Chapter 20. See equations 56 and 58 of Chapter 

20 . 
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If there are several alternative responses for an item, it may still be 
well to stick to a simple 1 or 0 scoring key. Then the procedure would 
be to compute the mean criterion score for each of the alternatives, to 
arrange them in order of magnitude, and to observe where the greatest 
difference occurred between successive means. The ones above this 
dividing point are scored 1, and those below are scored 0. A somewhat 
more elaborate procedure would be to use a measure of the significance 
of the difference for each of the possible cutting points, and to choose 
that which gave the largest critical ratio. 

If a very large number of cases are available for standardization, so 
that we can have confidence in the stability of the results, it may be 
reasonable to consider a more complex scoring key that would assign a 
different weight to each possible alternative. A procedure is given by 
Guttman (see Horst, 1941, page 341). In order to maximize the correla¬ 
tion between the score and the criterion, it is necessary to obtain the 
mean criterion score for those selecting each alternative, and then to 
assign weights such that the differences in weights are proportional to 
the differences in mean criterion scores. A simple method of doing this 
is to assign the value 0 to the alternative with the lowest mean criterion 
score, and then to subtract this mean from each of the others, and assign 
a rounded fraction of this difference in means as the weight for the alter¬ 
native. For example, in weighting it is probably desirable to limit the 
weights to the integers from 0 to 5, or from 0 to 10. Wilks (1938) has 
shown that under certain fairly general sets of assumptions, the correla¬ 
tion between one linear composite and another composite using different 
weights will differ from unity by about 1/n, where n is the number of 
different elements entering into each weighted composite (see equation 
47, Chapter 20). That is, elaborate weighting systems with fractional 
or negative weights probably should be avoided. The use of 0 and 1 
or of 0, 1, and 2 is enough for most situations. 

It is also possible to score a five-alternative multiple choice item by 
assigning a different weight to each alternative. The usual procedure 
is to select one of the alternatives as correct, and to score any one of the 
other four zero. Sometimes item analysis data indicate clearly that the 
persons selecting one wrong answer are much better or poorer than those 
selecting another wrong answer. If it is possible to standardize a test 
on five or ten thousand cases, it might be worth while to consider the 
possibility of differential weighting for each alternative. The most 
common plan at present is to use such detailed item analysis data only 
for the purpose of discarding the poorer distractors, since it is felt that 
scoring items either I or 0 is highly desirable. 
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9. Scoring of rank-order items 

In testing for certain types of knowledge it is frequently convenient 
to require the student to arrange the alternative answers in order with 
respect to some characteristic. If we wish to test for knowledge of 
chronology without the use of dates, a series of three to ten events can 
be listed and the student required to number them in order from the 
earliest to the latest. In testing for the students appreciation of a given 
philosophical or political viewpoint, it is possible to present three to 
five arguments for a given line of action, and require the student to 
mark 1 for the argument most likely to be used by a socialist (for 
example), 2 for the argument next most likely to be used, and so on to 
5 for the argument that is least likely to be used by a person with 
such a viewpoint. In order to test for very fine discrimination in 
any field, it is possible to ask a question, and then present the student 
with three to five answers. The task is for the student to grade these 
answers, just as if he were the instructor in the course, by ranking them 
in order from best to worst. In all cases like the foregoing, we have a 
problem of how to grade rank-order items. 

To simply prepare a key giving the correct order and then give one 
point for each agreement between the key and the student’s ranking is 
clearly a poor method. For example, if the correct order is 1, 2, 3, 4, 
the answer 2, 1, 4, 3 shows zero agreement with the key, and so does the 
answer 4, 3, 2, 1. Yet the first is clearly better than the second. One 
easy method for scoring such items is to insist that every item be correct 
in order for credit to be given; any error regardless of how many is given 
zero credit. Usually the subject matter expert deems such a method 
unsatisfactory. The person who makes only one inversion (hence has 
two disagreements with the key) is clearly better than the person who 
has things mixed up all along the line. A better method is to secure the 
differences between the rank order given by the subject and the rank 
order given by the key. For an elaborate scoring procedure we should 
square these differences, sum them, and compute the rank correlation 
by the formula 

6 Sd 2 

(10) « - I--;-, 

n 6 — n 

where 2 )d 2 is the sum of the squares of rank differences and n is the 
number of items ranked. However, the computation of a correlation 
coefficient for each such item on each paper introduces both considerable 
labor and considerable probability of error. 

A simple and satisfactory method for scoring such items is to use the 
sum of the absolute differences. If the rest of the examination is scored 
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in number of errors, this sum can be added directly in with the errors. 
If the rest of the examination is scored in terms of number correct, this 
sum can be subtracted from some constant to give zero disagreement 
the highest score and great disagreement the lowest score. The formula 
would be 

(11) Score = C- H\d\ 9 

where S| d | is the sum of differences ignoring sign, and 

C is a constant larger than the greatest 2| d | we are likely 
to find. (If in scoring we find a few negative differences, 
these may be counted as zero.) 

In ranking three items a still simpler method is available. Ask the 
student to mark + for the best of the three, 0 for the poorest, and to 
leave the middle one blank. For all students who have marked the item 
with one +, one 0, and one blank, the papers can be scored by matching 
the key against the student’s item and scoring only the alternative 
keyed + and the alternative keyed 0. The student gets one point for 
each of these, which means that he receives two points for perfect 
agreement with the key, one point if either the best or the poorest 
alternative has been confused with the middle one, and zero for all more 
serious confusions. In order to secure more different scores, it is possible 
to assign two points for agreement with the key on + and two points for 
agreement on 0; one point for leaving either the + or the zero alternative 
blank; and no credit for marking with the wrong symbol. Such a scoring 
system gives four points for perfect agreement; three points if the only 
error is a confusion of either the best or worst alternative with the middle 
one; zero for a complete reversal of the correct order; and one point 
for only one inversion from this worst order. It is not possible to get 
two points with this system if the student follows the directions. Such 
a scoring plan makes possible the rapid scoring of rank-order items, and 
has given scores correlating highly with total score in many instances. 

It should be noted that with rank-order items theorems involving K 
(number of items) cannot be applied except for parallel tests that contain 
matched rank-order type items. 

10. Summary 

For most tests in the aptitude and achievement field, it has been 
found that the number of items answered correctly, or the number of 
errors, is an eminently satisfactory score. The added labor of using a 
weighted composite of errors and correct responses is worth while only 
in certain special cases. 
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The notation used in the special scoring formulas is: 

R (rights), the number of items marked correctly. 

W (wrongs), the number of items marked incorrectly. 

S (skips), the number of items that have not been marked but are 
followed by marked items. 

U (unattempted), the number of consecutive items at the end of the 
test that are not marked. 

B (blank), the number of unmarked items (S + U). 

A (alternatives), the number of possible answers listed for each ques¬ 
tion. It is assumed here that the same number of alternatives 
are presented for each question. 

If a test that has been designed primarily as a power test turns out 
to have a large number of unattempted ( U ) items on some papers, and 
a small number on others, it may be that some students are using the 
last two minutes of testing time to mark answers without reading items 
in order to get the benefit of a chance score. In order to score the 
examination fairly, in spite of considerable variation in guessing from 
one person to another, one of the following weighted composites should 


be used. 


(1) 


or 

X w = R -— 

A - 1 

(2) 

or 


(6) 

U 

Xu = R + — • 


Equations 1 and 2 correlate perfectly but do not give identical scores. 

If a test is designed primarily as a speed test, and we find that there 
has been a considerable number of items skipped (S-score) or answered 
incorrectly (TF-score), it may be desirable .to introduce a small penalty 
for skips and a larger penalty for errors. The formula suggested was 

W S 

where C and D are arbitrary constants, C < D, in order to penalize 
errors (17) more than skips (S). It is perhaps appropriate to make D 
slightly larger than the number of alternatives A, and C slightly smaller 
than A — 1. 
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If time and error scores are to be weighted in determining a composite 
score, and if no criterion is available, judgments must be relied on to. 
determine relative weights. Such problems may be handled by the 
“indifference function” technique, or by a rapid and crude graphic 
method as illustrated in Figure 1. If a criterion is available, multiple 
correlation methods can be used as indicated by formulas 8 and 9. 

If we are dealing with items that do not have a clear-cut correct 
answer, such as items in a personality questionnaire, it is necessary to 
have a criterion in order to set up the scoring key. In order to maximize 
the correlation between score and criterion, weights should be assigned 
the different alternatives so that the differences in weights will be pro¬ 
portional to the differences in mean criterion score for the persons choosv 
ing each alternative. See the procedure given by Guttman in Horst 
(1941, page 341). 

Rank-order items may be quickly scored by an approximation to a 
correlation coefficient given by 

(11) Score = C — 2)| d |, 

where S| d | is the sum of absolute differences in rank between the cor¬ 
rect order and the order assigned by the student, and 
C is any arbitrary constant, larger than the greatest possible 
2|d|. 

A still simpler system suitable only for the ranking of three items is also 
described in section 9. 


Problems 

1. Derive the formula for the correlation between number correct and number 
incorrect for an objective test, assuming that there are no omissions. 

2 . Derive the formula for the correlation between number correct and number 
incorrect for an objective test. Assume that there are omissions and express the 
results in terms of the variances of number right and number omitted and the correla¬ 
tion between these two variables. 

3. Derive the formula for correction for chance success in a test each item of which 
has one correct and six incorrect choices. State clearly each assumption used in the 
derivation. 

4 . Comment briefly on the material in Moore’s 1940 article. 
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Methods of Standardizing 
and Equating Scores 


1. Introduction 

After having decided on an appropriate scoring system for the test, 
as indicated in the preceding chapter, we must make some decisions 
with reference to the distribution of gross scores obtained. In a schol¬ 
arship examination some are awarded the scholarships, and the rest 
are not. For Civil Service Examinations, certain persons are placed on 
the eligible list, while others are considered ineligible for certain types 
of jobs. In the examinations given by the College Entrance Examina¬ 
tion Board the scores are converted to a certain standard form and re¬ 
ported to college admissions officers, who use these scores along with 
other information in deciding which applicants to accept and which to 
reject. In a college achievement test, given by an instructor in his 
course, it is necessary to decide which students failed the examination, 
which ones made an A grade, which ones made a B grade, etc. In gen¬ 
eral, we may say that, in using the scores from an examination, it is 
necessary to determine one or more “critical scores ,, or to report the 
results in some standardized form to persons who will make such deci¬ 
sions, and possibly study the relationship of these scores to other var¬ 
iables. We shall now consider various factors in, and the different 
methods available for, determining critical scores and for standardizing 
test scores. 

2* Assessing the gross score distribution 

For every test, regardless of the standardizing system to be used, it 
is desirable to make a frequency distribution of the gross, or raw, scores 
and to inspect this distribution carefully. If the test is an achievement 
test, it is desirable for a test technician to discuss the various points 
with a subject matter expert, since either one alone might overlook 
important points. 
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The first points to note about any test are the number of items ( K) 
and the number of alternative answers (A) presented for each item. 
From this information we can determine three quantities very impor¬ 
tant in evaluating any distribution of scores. These quantities are: 

1. The perfect score, which is usually equal to K, the number of 
items in the test. 

2. The average chance score M C1 which is usually equal to K/A. 

3. The variance of a distribution of chance scores, which is Kp( 1 — p), 

where K is the number of items and p is the probability of answer¬ 
ing an item correctly. If p is taken as 1/A, the variance of a dis¬ 
tribution of chance scores becomes K(A — 1)/A 2 , and the standard 
deviation of the distribution of chance scores is # 



It should be noted that these considerations regarding the magnitude 
of a chance score apply only to power tests or to tests that are primarily 
power tests. In speed tests it is necessary to be certain that the number 
of errors made is negligible; methods for determining this have been 
discussed in Chapters 17 and 18. These three quantities ( K y K/A, 
and s c ) will show the possible meaningful score range for the test. A 
score that is within one or two standard deviations ( s c ) of a chance 
score should not be interpreted as signifying any knowledge of the 
subject matter of the examination. For example, if we take the standard 
that the score must exceed the average chance score by more than 2s Ct 
then, for a 25-item test of 5-choice items, we should have a perfect score 
of 25, an average chance score of 5, and, since $ c is 2, a reasonable upper 
limit for chance scores may be set at 5 + 2 X 2 = 9. That is, this 
examination has only 16 possible scores (10 to 25 inclusive) that could 
indicate varying degrees of achievement in the field. On the same 
basis, we see that a 10-item true-false quiz has only two possible scores 
(9 and 10) that could indicate varying degrees of achievement in the 
field. As a first check on any examination, it is well to be certain that 
the lowest score that is taken as indicative of knowledge is well above 
the average chance score, and to be certain that the number of possible 
scores between this lowest score and the highest score is considerably 
greater than the number of subgroups we wish to determine from the 
test. 

Having obtained the lowest non-chance score and the perfect score 
from knowing only the number of items, and the number of choices per 
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item, we next make a frequency distribution of scores and find the 
mean and standard deviation of this distribution. It is also necessary 
to use some method of determining the reliability of the test and the 
error of measurement in order to compare the error of measurement 
with the score range between the upper and lower bound of any given 
subgroup. For example, if an achievement examination is being used 
to divide students into A’s, B’s, C’s, and D’s, it would seem desirable to 
have the score range of about three or four times the error of measure¬ 
ment from the lowest B to the highest B. We can readily see that, if 
this score range is equal only to the error of measurement, then through 
examination error alone quite a few students who should receive A’s 
will receive C’s, and vice versa. Errors in classification of students at 
the borderline between A and B, between B and C, etc., cannot be 
avoided under any circumstances. It is possible, however, by making 
the distance between the upper and lower bound of any one subclass 
large, in comparison with the error of measurement, to be relatively 
certain that errors of classification of two or more groups will be avoided. 

In general the significant or important distances on the scale, such 
as the distances between different critical scores or the differences be¬ 
tween successive school grades or successive years, should be very large 
in comparison with the error of measurement of the test. A difference 
as large as this is necessary in order to insure that important decisions 
are not made on the basis of accidental fluctuations. 

It should be noted that nothing has been said about “per cent of a 
perfect score” as one of the criteria for judging the distribution of raw 
scores. Unless we have very thorough procedures for pretesting items 
so that item difficulty and test reliability are equated from one examina¬ 
tion to another, the amount of knowledge indicated by a given per cent 
of perfect score will vary tremendously from one examination to another. 
If an examination is composed of items that are answered correctly on 
the average by 80 per cent of the students, the average student will make 
80 per cent of a perfect score, and the upper half of the class will be 
grouped in the narrow score range between 80 per cent of perfect and 
perfect. Compare such an examination with one which attempts to 
discriminate between the good student and the very superior student. 
This latter examination would probably be composed of items answered 
correctly on the average by 60 per cent of the students so that the 
average student would make 50 per cent of a perfect score, and the wide 
score range from 50 per cent correct to perfect would be available for 
distinguishing between various degrees of ability in the upper half of 
the class. Scoring these two examinations on the basis of per cent of a 
perfect score would not give satisfactoiy results. Each distribution 
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must be inspected to determine where the average score and scores one, 
two, and three standard deviations above and below average lie with 
respect to the perfect score and the lowest non-chance score. The 
judges should then select the various critical score points, taking care 
to make the distance between these points reasonably greater than the 
error of measurement of the test. 

The effect on standards of requiring “successive hurdles” versus per¬ 
mitting multiple attempts to pass the examination must be considered 
in setting any critical score. The term successive hurdles is used to 
designate a procedure whereby the successful candidate must have 
passed each of several tests; failure on any one disqualifies the person. 
If such an administrative procedure is being followed, it is essential to* 
pass many more at any given step than are desired to pass the total 
procedure. On the other hand, the effect of permitting multiple at¬ 
tempts is to lower standards, particularly if the examination is unreli¬ 
able. In effect this is the opposite of the successive hurdles procedure 
in which a single failure disqualifies the candidate. If many trials are 
allowed, the candidate is usually passed if he succeeds in any one attempt. 
In order not to be accepted the candidate must fail in every attempt. 
It is clear then that, if multiple trials are permitted, it is well to err on 
the side of fixing the lowest passing score too high; whereas, if a succes¬ 
sive hurdles procedure is followed, it is well to err on the side of fixing 
the lowest passing score too low. 

3. Standardizing by expert judgment, using an arbitrary scale 

In some cases the major interest of the subject matter expert lies in 
one .or two critical scores; yet it is desirable, or required by some regula¬ 
tion, that many different score values be reported. For example, in 
some colleges it is conventional to grade on a scale from 100 (repre¬ 
senting a perfect score) to 65 or 70 (representing the failure line). In 
Navy schools it is conventional to grade on a scale from 4.0 (represent¬ 
ing a perfect score) to 1.0, or possibly lower, for the poorest possible 
performance. On this scale 2.5 is a critical score (the lowest passing 
grade). In Civil Service ratings 70 is defined by regulations as the mark 
to be assigned the lowest acceptable performance, and 100 is the highest 
mark to be assigned. 

In making any transformation from a given raw score scale to some 
conventional scale with critical limiting values, the simplest and best 
procedure is to determine the limiting values carefully and then to make 
a linear interpolation between these values. Such a procedure is de¬ 
scribed for use in Navy schools by Stuit (1947), pages 485-487. A simi¬ 
lar procedure for converting raw scores into Civil Service ratings is 
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described by Adkins et al (1947), pages 194-202. The simplest method 
is a graphic one. 

1. Prepare a graph in which the various possible raw score values are 
indicated on one axis, and the various possible values of the desired 
arbitrary scale are indicated on the other axis. 

2. Prepare a frequency distribution of the raw score values with the 
various key points such as the mean, standard deviation, standard error 
of measurement, perfect score, average chance score, and average 
chance score plus one or two times the standard deviation of such scores 
(see equation 1) indicated. 

3. Determine the raw score corresponding to some critical level, such 
as the lowest passing mark. In determining this point all relevant fac¬ 
tors must be considered, such as the probable difficulty level of the 
examination, the standards it is necessary to maintain, and the number 
and per cent of candidates above or below this critical point. Paren¬ 
thetically, it may be remarked that sometimes a committee will feel 
that it is desirable to look for gaps in the score distribution, and to set 
the lowest passing mark just above such a gap. As pointed out by 
Adkins et al. (1947), pages 197-198, such gaps are purely accidental and 
should be ignored in favor of more rational considerations in determin¬ 
ing the critical points. 

4. Determine the raw score corresponding to another fixed point, 
such as the highest score to be assigned. The top score of 100, 99, or 
4.0 need not be assigned to a raw score that corresponds to a perfect 
paper. If the examination is very difficult, it might be desirable to 
take a raw score considerably lower than the perfect one to correspond 
to the highest assigned score. At the other extreme, if the examination 
is very easy so that, for instance, 5 or 10 per cent of the persons made a 
perfect raw score, it might be desirable to assign a score below the 
highest allowable (such as 80 or 3.5) to the perfect raw score. 

5. Plot the points determined in steps 3 and 4 on the graph, and con¬ 
nect them with a straight line. From this line it is possible to read off 
the transformed score corresponding to each raw score. 

By repeating steps analogous to 3 and 4 for other critical points it is 
possible to set up several different linear transformations in different 
parts of the scale, should that appear desirable. 

4. Transformations to indicate the individual’s standing in 

his group—--general considerations 

In many testing situations it is not possible or desirable or necessary 
to make immediate decisions for action on the basis of the gross score 
distribution. In such situations it is conventional to transform the 
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gross scores into some uniform set of numbers that indicates the relative 
standing of the individual in his group. For example, transformed . 
scores on the tests of the College Entrance Examination Board are re¬ 
ported to the designated colleges. The admissions officer of each college 
determines which scores will be regarded as critical for purposes of ad¬ 
mission to his institution. In the aptitude testing programs of the 
Army, Navy, and Air Forces, during the second World War, the tests 
were given and the transformed scores made a part of the man's per¬ 
manent record. As experience accumulated regarding the performance 
of men in different schools and jobs or as the relative needs in the dif¬ 
ferent schools changed, the critical score requirements could be specified 
and altered. * 

Four different gross score transformations that indicate the relative 
standing of the individual in his group will be considered: (1) linear 
transformations, including (a) standard score and (6) linear derived 
scores; (2) non-linear transformations, including (a) percentile score 
and ( b ) normalized score. 

In using standard, linear derived, percentile, or normalized scores, we 
should bear in mind that such scores indicate only the relationship of 
the individual to a given group. They indicate nothing about the gen¬ 
eral level of knowledge or attainment of the group or its members. For 
example, a set of percentile or standard scores on a test in American 
history would not indicate whether the students had a comprehensive 
grasp of the major items in American history or only a very meager 
knowledge. Such an assessment must be based on the judgment of 
subject matter experts, and can never be determined by clever quanti¬ 
tative scoring devices. In setting up the test, the judgment of the sub¬ 
ject matter expert is used to include a good sampling of items from the 
field, and as indicated in section 2 of this chapter the subject matter 
expert must assess the gross score distribution, with the help of a test 
technician, in order to determine critical scores between the chance 
score and the perfect score. From this point of view the most satisfac¬ 
tory testing programs are those closely related to training programs so 
that the subject matter expert may, for instance, judge: “The perform¬ 
ance of these students is unsatisfactory; I will step up the quality and 
quantity of work demanded in the training program so that the next class 
will make a higher average gross score than this one has made." By 
using the same or parallel tests, it is then possible to see whether or not 
the altered training program has produced the desired result of a higher 
test score. For an illustration of such a use of testing in conjunction 
with training programs, see Stuit (1947), pages 303-313. The blind use 
of group norms, such as the standard, percentile, or normalized scores, 
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without any assessment of the absolute level of achievement in terms of 
judgments of subject matter experts, may serve to conceal marked in¬ 
adequacy of training standards. 


5* Linear transformations-—standard score 
The basic linear transformation of gross scores is known as the 
“standard score.” The individual's score is expressed as a deviation 
from the mean of the distribution (that is, the mean is taken as the 
origin or zero point). The score unit is taken as the standard deviation 
of the gross score distribution. Standard scores will thus have a mean 
of zero and a standard deviation of unity. Using z t - to designate the 
standard score of the zth individual, we may write 


( 2 ) 


— 


Xi- M 


a 


where Xi is the gross score of the sth individual, 
n is the population mean, and 

<r is the standard deviation of the population. 

• 

Since the mean and standard deviation of the population are usually 
unknown, it is conventional to have a large sample and to use the mean 
and standard deviation of this sample in computing standard scores. 
The formula may be written 


(3) 


* 


X t -M x 

> 

Sx 


where Mx and $x are the mean and standard deviation of the distribu¬ 
tion of gross scores (X). The numerator of equation 3 is frequently 
designated by the lower case x = Xi — Mx), and is referred to as a 
deviaiim score. The term “deviation score” usually refers to scores 
expressed in terms of. deviations from the mean of the distribution. Since 
the population mean and standard deviation, as indicated in equation 2, 
are usually unknown, it is usually necessary to use equation 3 instead 
of equation 2 in calculating standard scores. However, the general 
problem of standardizing several different forms of a test or of using 
standard scores when several different groups are involved becomes 
clearer if the problem is considered in terms of equation 2. This equation 
indicates that the problem is to define clearly the population in terms of 
which the standard scores are to be computed, and then to use maximum 
likelihood estimates of the gross score mean and standard deviation of 
this population. 
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From equations 2 or 3, we see that a standard score is a score in 
which the mean of the distribution is zero and its standard deviation is 
unity. All standard scores above the mean will be positive, and all 
below the mean will be negative. A person whose raw score is at the 
mean of the distribution will have a standard score of zero, since for 
that person 

M v — M x 
—-= 0 . 

A person whose score is one standard deviation above the mean will 
have a standard score of 1.00, since X, - M x = $x; hence equation 3 
equals 1.00. ^ 

In order to use equation 3 for computing the standard score equivalent 
to each gross score, it is convenient to rewrite it in the form 



In computing z-scores when X, > M x , enter —Mx/s\ in the computing 
machine, clear the keyboard, put the quantity (1/s.v) in the keyboard, 

TABLE 1 

Frequency Distribution 


Score 

Fre- 

V 

fv 

/i ’ 2 



quency 





120-129 

2 

—5 

-10 

50 

Assumed mean 174.5 

130-139 

3 

-4 

-12 

48 

plus 

140-149 

12 

-3 

-36 

108 

correct ion term —1.0 

150-159 

23 

-2 

-46 

92 

equals - 

160-169 

37 

-1 

-37 

37 

gross score mean 173.5 

170-179 

51 

0 




180-189 

39 

+ 1 

+39 

39 


190-199 

21 

+2 

+42 

84 

—-f’ 2 * 2.98 - 0.01 * 2.97 

N 

200-209 

9 

+3 

+27 

81 


210-219 

2 

4-4 

+8 

32 

Gross score variance =* 2.97(C7) 2 

220-229 

1 

+5 

+5 

25 

* 297.0 

Column sums 

200 


-20 

596 

Gross score standard deviation 

Sums/iV 



-0.1 

2.98 

--v/297 - 17.234 

Correction 






term 


W) 

- -1.0 


Class Interval (Cl) « 10 
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and add it in X& times, where Xb is the gross score value just above the 
mean score. Record the 2 -value corresponding to this X-score, and then 
add in (l/sx) once more for the next higher score, and so on until the 
2 -value for the highest attained or the highest possible X-score has 
been found. When X* < Mx, the procedure is similar, except that all 
the signs of the quantities must be reversed. Enter +(Mx/sx ) in the 
machine, put in the quantity (1/sx), and subtract it once to give the 
2 -score corresponding to a gross score of 1, twice to give the value cor¬ 
responding to a gross score of 2, and so on until the value X 0 is reached, 
where X a is the gross score value just below the mean. All the 2 -scores 
corresponding to X-scores below the mean must be given negative signs. 

This computing procedure is illustrated with the frequency distribu¬ 
tion of 200 cases shown in Table 1 on the preceding page. This fre¬ 
quency distribution has a mean of 173.5, a variance of 297.0, and a 
standard deviation of 17.234. Substituting these values in equation 3 
gives 

Xi - 173.5 

z -- 

17.234 

The computation equation 4 thus becomes 

1 173.5 

Z a- - Xi - 

17.234 17.234 

or 

2, = 0.058025X* - 10.067309. 


This equation is used directly in the computing machine for computing 
the standard score equivalent of all gross scores above the mean. Table 2 
illustrates a worksheet used to compute a standard score equivalent for 
the midpoint of each class interval . 1 The entries below the horizontal line 
in Table 2, where X* takes in succession the values from 174.5 to 224.5, 
correspond to scores higher than the mean of the distribution. Since 
the class interval in Table 1 is 10, the coefficient in the computing equa¬ 
tion is multiplied by lOX at each step, instead of by X. For the nega¬ 
tive entries above the horizontal line in Table 2 the equation used in the 
computing machine is 

— z% - 10.067309 - 0.058025X* 

where X t - takes in succession the values from 124.5 to 164.5 shown in 
the column labeled X in Table 2. Column z gives the standard scores 

1 For linear derived scores to be discussed in the next section, a convenient work¬ 
sheet will be shown that gives a derived score equivalent for each different gross 
score. This procedure, illustrated in Table 3, may be adapted to z-scores if a 
standard score equivalent is desired for each gross score. 
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to six decimal places. However, standard scores of psychological tests 
should at most be given to two decimal places, as shown in column z'. 
It may be noted that, even with a test reliability as high as .99, the 
error of measurement is 


s x Vl - .99 = s x VX)I = .1 s x , 

so that the error of measurement is greater than one-tenth the standard 
deviation for practically all tests. 

TABLE 2 


Computing Form for Standard Scores 


X 

z 

z' 


124.5 

-2.843196 

-2.84 



134.5 

-2.262946 

-2.26 



144.5 

-1.682696 

-1.68 

-z» = 10.067309 

- 0.058025X* 

154.5 

-1.102446 

-1.10 



164.5 

-0.522196 

-0.52 



174.5 

+0.058054 

+0.06 



184.5 

+0.638304 

+0.64 



194.5 

204.5 

+1.218554 

+1.798804 

+1.22 

+1.80 

z, = 0.058025X, 

- 10.067309 

214.5 

+2.379054 

+2.38 



224.5 

+2.959304 

+2.96 




In order to check the entries in Table 2, the differences between adja¬ 
cent entries should be computed. In this table these differences are 
each equal to 58, which is ten times the multiplying coefficient in the 
computing equation. It is also desirable to recompute the entries for 
about three selected points, one near the middle and one near each end 
of the scale. Gross errors may also be detected by computing the gross 
score equivalents for —3, —2, —1, 0, +1, +2, and +3 standard 
deviations from the mean to see that these scores fall in the proper 
intervals. 

If a graphic method of setting up the transformation from X-scores 
to z-scores is preferred, the simplest method is to set up appropriate 
coordinates on a graph, including the range of X-seores on one axis and 
z-scores from about —3.0 to +3.0 on the other axis. Select one X-score 
approximately —2 or —3 standard deviations below the mean, and 
calculate the corresponding z-score. Do the same for a high X-score 
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approximately +2 or +3 standard deviations above the mean. Plot 
these two points on the graph, and connect them with a straight line. 
Several check points may then be selected. For example, a z-score of 
zero should correspond exactly to Mx on the X-scale, and scores that 
are one standard deviation above and below the mean on the X-scale 
should correspond to z-values of plus one and minus one, respectively. 

The standard score is primarily useful for theoretical purposes. For 
example, it simplifies algebraic derivations involving variances and co- 
variances; and tables of the normal curve have z-scores as one of their 
entries. However, it has marked disadvantages as a method of report¬ 
ing scores for the individuals of a group. The range from —3 to +3 is 
awkward since it necessitates the use of negative and positive numbers. 
Also, in order to have a sufficient number of different scores, it is neces¬ 
sary to use decimals. It is conventional, therefore, to use some more 
convenient linear transformation of standard scores for reporting pur¬ 
poses. These, termed linear derived scores, will be considered in the 
next section. 

6. Linear transformations—linear derived scores 

Since the standard score (z-score) with a mean of zero and a standard 
deviation of unity necessitates using negative and decimal scores, it is 
usual to report scores in terms of some arbitrary distribution that has a 
standard deviation considerably greater than unity and a mean that is 
four or five times the standard deviation. Such a set of units, called 
here linear derived scores, avoids both negative and fractional scores. 

Several different transformations of this type have been found useful 
in different circumstances. For example, the Board of Examinations at 
the University of Chicago has used a linear derived score with a mean 
of 20 and a standard deviation of 4. Most scores would thus lie between 
8 and 32; and, even if an occasional score of plus or minus five standard 
deviations were found, we should still have scores ranging only from 
0 to 40. Such scores would not be confused easily with percentile scores 
that were used in reporting some of the entrance tests, and a class inter¬ 
val of one-fourth standard deviation is convenient for computing var¬ 
iances and correlations so that decimal scores need not be used. The 
College Entrance Examination Board adopted a linear derived score 
system for reporting scores on its examinations to the colleges. These 
scores have a mean of 500 and a standard deviation of 100. They 
range from a lower limit of 200 to an upper limit of 800, and cannot 
possibly be confused with percentile ratings, grade ratings (with 100 as 
perfect and 60 or 65 as failure), mental age ratings (in the 10 to 20 
range), or I.Q. ratings (in the 100 to 150 range) that may appear on the 
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applicant's secondary school record. Because such scores would be un¬ 
wieldy to record or to use in IBM card operations, the College Entrance 
Examination Board also adopted another linear derived score system 
for use within the office for keeping certain records, computing correla¬ 
tions, making item analyses, etc. This system uses a mean of 13 and a 
standard deviation of 4. The particular advantage of this scale is that 
the scores can be recorded in two columns of an IBM card, and the 
squares of the scores can be recorded in three columns. A score as 
large as 4.5<r would be 31, and the square of 31 is 961. Using five col¬ 
umns of the card to record the score and the square of the score facilitates 
many operations that require computing sums and sums of squares. 

During the second World War, the United States Navy used a basic 
aptitude test battery and reported scores in terms of a linear derived 
scale with a mean of 50 and a standard deviation of 10. Such a scale 
could be reported in two columns of an IBM card. Moreover, as long 
as operations requiring sums of squares were not used to a great extent, 
maximum use was made of the IBM cards, and a scale had reasonably 
fine subdivisions. The United States Army used an aptitude test bat¬ 
tery and reported the scores on a linear derived scale with a mean of 100 
and a standard deviation of 20. This made the scale somewhat compa¬ 
rable to the I.Q. scale so that not too much change in habits regarding 
meaning of the numbers was required to make reasonable judgments 
for the new test scores. These examples illustrate some of the types of 
linear derived scores in use, and indicate some of the reasons for select- 
ing given arbitrary values for the mean and standard deviation of the 
derived score scale. 

In order to determine the formula for computing any of these linear 
derived scores, let us use to designate the linear derived score of the 
ith individual and write 

(5) Wi = s w Zi + M w , 

where M w is the value that has been selected as convenient for the mean 
of the linear derived scores, and 
s w is the value that has been selected as convenient for the stand¬ 
ard deviation of these scores. 

Since the standard deviation of the 2 -scores is unity, multiplying each 
2 -score by s w will give a set of scores with a standard deviation equal to 
s w . Also, since the mean 2 -score is zero, adding M w to each score will 
give a set of scores with a mean equal to M w . Thus the transformation 
of equation 5 insures that the new scores will have the desired mean and 
standard deviation. 
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To express the w-scores directly in terms of the gross scores, substitute 
equation 4 in equation 5 and write 

(6) - (-) X, + M w — (-) M x , 

\Sx / \sx/ 

where the terms have the same definitions as in equations 3 and 5. 
The computing procedure is similar to that for equation 4, except that 
no provision need be made for negative scores, since M w and s w are 
selected so that all scores are positive. The procedure is to enter 
M w — ( s w /$x)Mx in the keyboard, put it into the machine, clear the 
keyboard, and enter ( s w /$x ). Add this quantity once to obtain the 
w-score equivalent to an X-score of one, twice for the equivalent of an 
X-score of two, and so on until the highest X-seore has been reached. 
Again a graphic check can be made by computing the w-\ score equal to 
a very low X-score, and to a very high X-score, plotting these two 
points on a graph, connecting them with a straight line, and then com¬ 
puting several intermediate check points. 

Linear derived scores (including, of course, standard scores) have this 
very valuable property: the characteristics of the original distribution 
of gross scores are duplicated in the transformed scores. The indices of 
skewness and kurtosis for the distribution of gross scores are identical 
with the indices for the distribution of linear derived scores, and both 
sets of scores will have the same correlation with any other variable. 
Non-linear transformations of gross scores will in general have indices 
of skewness, kurtosis, and correlation that are different from those of 
the original gross scores. 

The data of Table 1 are used to illustrate the computation of linear 
derived scores with a mean of 500 and a standard deviation of 100. 
Substituting these values for M w and s w , and the mean and standard 
deviation of Table 1 for Mx and sx in equation 6, gives the equation 


Wi = 


100 

17.234 


Xi + 500 - 


100 

17.234 


173.5, 


which may be written as the computing equation 


Wi - 5.8025X* - 506.7309. 


The rectangular layout of Table 3 furnishes a convenient method of 
recording a linear derived score equivalent for each gross score of 
120 to 229. The computing procedure is to enter the additive term 
(—506.7309) in the keyboard and into the machine, then to clear the 
keyboard and to enter the coefficient (+5.8025). This coefficient is 
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then multiplied by 120 to give the first entry (190). One additional 
rotation of the machine is needed/to give each of the remaining 109 
entries in Table 3. The results are recorded to only three digits, which 
corresponds to units of one-hundredth of a standard deviation. The 
best method for checking a table like Table 3 is first to compute suc¬ 
cessive differences. These differences should each be equal to the con¬ 
stant term, which in the present illustration is about 5.8, so that/to the 


TABLE 3 

Computing Form for Linear Derived Scores 


Wi 


_ 100 _ 
17.234 ‘ 


+ 500 


or 


100 

17.234 


173.5 


Wi = 5.802 5Xi - 506.7309 


4 



0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

12- 

190 

195 

201 

207 

213 

219 

224 

230 

236 

242 

13- 

248 

253 

259 

265 

271 

277 

282 

288 

294 

300 

14- 

306 

311 

317 

323 

329 

335 

340 

346 

352 

358 

15- 

364 

369 

375 

381 

387 

393 

398 

404 

410 

416 

16- 

422 

427 

433 

439 

445 

451 

456 

462 

468 

474 

17- 

480 

485 

491 

497 

503 

509 

515 

520 

526 

532 

18- 

538 

544 

549 

555 

561 

567 

573 

578 

584 

590 

19- 

596 

602 

607 

613 

619 

625 

631 

636 

642 

648 

20- 

654 

660 

665 

671 

677 

683 

689 

694 

700 

706 

21- 

712 

718 

723 

729 

735 

741 

747 

752 

758 

764 

22- 

770 

776 

781 

787 

793 

799 

805 

810 

816 

822 


nearest unit the difference is usually 6, with an occasional 5. Second, 
to check for gross errors, it is desirable to determine the gross score 
points corresponding to —3, —2, — 1, 0, +1, +2, and +3 standard 
deviations from the mean and to see that these are, respectively, 200, 
300, 400, 500, 600, 700, and 800. 

Linear derived scores, like standard scores, may also be obtained 
graphically. The best procedure is to set up the gross score and the 
derived score scale in suitable units on graph paper. The gross score 
and corresponding linear derived score are then found for three points, 
such as —3, 0, and +3 standard deviations from the mean. These 
three points are plotted; they should lie in a straight line. This line is 
the transformation line from which the derived score equivalent for a 
given gross score, or the reverse, may be read. 
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Let us contrast the properties of linear transformations of gross 
scores (such as standard and other derived scores) with the properties 
of non4inear transformations (percentile and normalized scores) to be 
considered next. 

1. The linear transformation involves no assumptions about the 
distribution of the population or the sample. It has third and 
fourth moments identical with those of the raw score distribution. 
This fact has several important consequences. 

2. It is possible to tell from the distribution of transformed scores 
whether the test was too easy, too difficult, or about the correct 
difficulty level for the group. 

3. Since the correlation between gross scores is identical with the 
correlation between linear transformations of gross scores, the 
equations dealing with the effect of test length and group hetero¬ 
geneity on reliability and validity (see Chapters 6 to 13) hold for 
gross scores and for any linear transformation of gross scores. The 
equations developed in Chapters 6 to 13 do not necessarily hold 
for non-linear transformations of gross scores. 

4. Equating various forms of tests is simpler if some linear transforma¬ 
tion is used, since such a transformation depends only on estimat¬ 
ing two parameters, the mean and the variance. The theory for 
equating when transformations are non-linear is more difficult to 
develop, and probably will give results with greater sampling errors. 

7. Non-linear transformations—percentile ranks 

We shall consider only the two most commonly used non-linear 
transformations, namely, percentile scores and normalized scores. 

A given individual’s percentile score indicates the percentage of per¬ 
sons in the distribution who score less than that individual. Consider a 
distribution of ten cases, each of which makes a different gross score. 
Each person is considered to occupy one-tenth of the entire percentile 
range from 0 to 100, as illustrated: 

0 10 20 30 40 50 60 70 80 90 100 

5 15 25 35 45 55 65 75 85 95 

The score assigned to each person is the midpoint of the range occupied 
by that person so that, for a distribution of ten persons, the percentile 
scores will be 5, 15, • • •, 95, as indicated above. If several different 
persons make the same score, each person’s score is the midpoint of the 
range occupied by all of them. In terms of the foregoing illustration, 
assume that the lowest three persons made the same score and that the 
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second and third from the top made the same score. In such a case we 
should have 

0 10 20 30 40 50 60 70 80 90 100 

■' ■ k l 4 -J— 

v ---I I I I v-| 

15 35 45 55 65 80 95 

three percentile scores of 15 and two of 80, as illustrated. For a distri¬ 
bution of 100 cases, each person having a different score, the percentile 
scores would begin with 0.5 and proceed by unit steps to 99.5. 

Let us use the data of Table 1 to illustrate the general procedure for 
calculating a percentile equivalent for the midpoint of each class interval. 
Table 4 illustrates the computation of a percentile corresponding to the 
midpoint of each class interval and to the boundary between the class 
intervals. The midpercentile for a given interval is assigned to all the 
cases in that interval. The procedure is to compute 100/2#, enter this 
figure in the calculating machine keyboard, and multiply by the number 
of cases in the lowest class interval to obtain the midpercentile for that 
interval. Multiplying a second time by the number of cases in the 
lowest class interval gives the percentile corresponding to the upper 
bound of the lowest, and the lower bound of the next class interval. 
We then multiply twice by the frequency in the next class interval, and 
so on until the percentile score 100 is reached. 

In Table 4 this procedure is illustrated for a distribution of 200 cases. 
The quantity 100/2# is 0.25. This quantity is entered in the machine 
and multiplied by 2, giving 0.50, then by 2 again, giving 1.00. Thus the 
percentile score assigned to the lowest two persons is 0.5. Next multiply 
by 3, the frequency in the next class interval, and enter 1.75, the per¬ 
centile score for the three persons scoring in the 130’s; then by 3 again, 
obtaining 2.50 for the boundary between the second and third class 
intervals. This procedure is continued until the final check percentile 
is obtained. The percentile equivalent of the upper bound of the highest 
class interval must be 100.000 • • • to as many decimal places as are being 
recorded. In Table 4 the percentile corresponding to the upper and 
lower bound of each class interval has been recorded (the upper bound of 
one class interval being identical with the lower bound of the next higher 
class interval). Since only the midpercentile is used, it is better proce¬ 
dure to record only the midpercentile and omit the upper and lower 
bounds. They were included to make the computational procedure 
clear. Also for a check on the number of revolutions that should be 
recorded in the calculating machine at each step, the last three columns 
of Table 4 are given. The speediest method of calculating percentiles is 
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TABLE 4 

Computing Form for Percentile Scores 


X 

Fre¬ 

quency 

Cumu¬ 

lative 

Fre¬ 

quency 

V 

Corresponding 

Multipliers 

Lower 

Mid 

Upper 

Lower 

Mid 

Upper 

120-129 

2 

2 

0.00 

0.50 

1.00 

0 

2 

4 

130-139 

3 

5 

1.00 

1.75 

2.50 

4 

7 

10 

140-149 

12 

17 

2.50 

5.50 

8.50 

10 

22 

34 

150-159 

23 

40 

8.50 

14.25 

20.00 

34 

57 

80 

160-169 

37 

77 

20.00 

29.25 

38.50 

80 

117 

154 

170-179 

51 

128 

38.50 

51.25 

64.00 

154 

205 

256 

180-189 

39 

167 

64.00 

73.75 

83.50 

256 

295 

334 

190-199 

21 

188 

83.50 

88.75 

94.00 

334 

355 

376 

200-209 

9 

197 

94.00 

96.25 

98.50 

376 

385 

394 

210-219 

2 

199 

98.50 

99.00 

99.50 

394 

396 

398 

220-229 

1 

200 

99.50 

99.75 

100.00 

398 

399 

400 

N = 

200 









100 100 
~2N ~ 400 


0.25 


to follow the procedure indicated in Table 4, recording only the midper¬ 
centiles and the final check percentile of 100.00 

A routine for computing percentiles that gives a check on the number 
of revolutions in the machine at each step and records only midpercen¬ 
tiles is shown in Table 5. The columns X and / give scores and frequen¬ 
cies as before. A zero frequency is added for a hypothetical class inter¬ 
val below the lowest.and above the highest. Column /' gives the sums 
of adjacent entries in column /. The column labeled 2/' is a cumulative 
frequency of the f column. The entries in column 2/' are identical 
with those in the next to the last column in Table 4, except that the 
check multiplier of 400 (2 N) has been added. The quantity 100/2# 
(0.25) is multiplied in turn by each of the entries in column 2/', giving 
the percentiles in the column labeled p. These are identical with the 
midpercentiles of Table 4, except that the final check percentile appears 
at the bottom. 

Regardless of the original shape of the distribution of gross scores, the 
distribution of percentile scores will be rectangular. Percentile scores 
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furnish a convenient method of indicating a person’s standing relative 
to a specified group. Such scores are easy to explain to other persons, 
and are felt to be readily understood. Here, however, the advantages 
of percentile scores end. Such scores cannot legitimately be subjected 


TABLE 5 

Computing Form for Percentile Scores 


X 

/ 

r 

2f 

P 


0 

2 

0 

0.00 

120-129 

2 

5 

2 

0.50 

130-139 

3 

15 

7 

1.75 

140-149 

12 

35 

22 

5.50 

150-159 

23 

60 

57 

14.25 

160-169 

37 

88 

117 

29.25 

170-179 

51 

90 

205 

51.25 

180-189 

39 

60 

295 

73.75 

190-199 

21 

30 

355 

88.75 

200-209 

9 

11 

385 

96.25 

210-219 

2 

3 

396 

99.00 

220-229 

1 

1 

399 

99.75 


0 


400 

100.00 

* 

II 

II 

200 





100 

100 

0.25 




SSB - = 



2 N 

400 



Enter 100/2# in the machine. Multiply cumulatively by the entries in /'. The 
dial indicating number of revolutions will show successively the entries in 2/'. 
Check: The last product should be unity to as many decimals as are being recorded. 

to the usual arithmetical operations. For example, if two tests are in¬ 
volved, and Mr. A has a percentile rating of 60 in one and 70 in the 
other, whereas Mr. B has ratings of 50 and 80, the procedure of averag¬ 
ing the percentiles would give 65 in both tests. Mr. B, however, 
probably would have a higher average if the original gross scores were 
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used. Just as average percentiles are misleading, so the correlation co¬ 
efficients found from using percentiles are different (usually smaller) 
from those found with gross scores. The amount of drop in correlation 
brought about by changing from gross scores to percentile scores in a 
normal distribution has been discussed by Karl Pearson (1907). Pear¬ 
son indicates that at most the correlation between normalized scores is 
.0180 greater than the correlation between percentile scores. 

From the illustrations given, we see that the maximum possible and 
the minimum possible percentile scores are functions of the size of the 
group taking the test. For a distribution of ten cases, these limits are 
95 and 5. For a distribution of a hundred cases, these limits are 99.5 
and 0.5. For distributions of a hundred cases or over, the effect of N 
on percentile scores can usually be ignored. However, normalized 
scores of the very high-scoring and the very low-scoring persons are 
markedly affected by the number of cases in the distribution and by 
slight differences in the extremes of the distribution. These effects will 
be illustrated in the discussion of normalized scores in section 8. 

The most striking defect of percentile scores appears, however, when 
we consider the problems of making norms comparable from group to 
group or test to test. Each percentile score is sensitive to any local 
change in its part of the distribution. Unlike the standard scores, the 
percentile score does not depend upon certain constants characteristic 
of the distribution as a whole. Standard scores, as indicated in equa¬ 
tion 2, depend upon only two parameters, a mean and a standard devia¬ 
tion. 

In the equating of percentile scores, no such simple parameters exist. 
Thus we see why it is that, with the growth of testing techniques, the 
percentile score has gradually been abandoned as a basic type of score, 
despite its seeming ease of interpretation. It is frequently convenient, 
however, to supplement linear derived scores with a table of percentiles 
expressed with reference to some specified group to aid in the initial 
interpretation of these scores. 

8. Non-linear transformations—normalized scores 

Since the normal distribution has many convenient properties, and 
since many distributions have been found to be normal or Gaussian 
distributions, another type of score is used in which the frequency dis¬ 
tribution has been distorted from its original shape into a normal 
distribution. 

After percentile scores are obtained, the normalized scores are ob¬ 
tained from tables of the normal curve. The base line (usually listed 
in the tables as x or z) value corresponding to each percentile is found. 



Chap. 191 Methods of Standardizing and Equating Scores $$1 

Such a set of scores would range from —3 to +3, which is sometimes 
regarded as an undesirable score range. So again, as in the change 
from standard scores to the more general linear derived scores, we may 
multiply the normalized scores by any suitable value to give a standard 
deviation greater than unity, and we may add any suitable value to 
avoid negative scores. 

Like percentile scores, the normalized scores do not duplicate the 
properties of the original gross score distribution. Regardless of the 
skewness and kurtosis of the original distribution, the skewness of the 
normalized scores will be zero and the kurtosis three. However, the 
usual arithmetic operations with scores, such as averaging and calculat¬ 
ing correlations, are probably legitimate operations to perform with 
normalized scores, as they are not with percentile scores. The problem 
of comparability from test to test and group to group is more difficult 
with normalized than with standard or linear derived scores. Thurstone 
(1925 and 19276), however, has presented a method for dealing with 
this problem. Flanagan (19396) has described the use of a system of 
normalized scores by the Cooperative Test Service. 

As in the case of percentile scores, the range of normalized scores 
varies with the number of cases in the distribution. With normalized 
scores this difference is very marked at the extremes of the distribution. 
For a distribution of 10 cases, the percentile score limits are 95 and 5. 
The corresponding normalized score limits are ^1.64. For a distribu¬ 
tion of 100 cases, the percentile score limits of 99.5 and 0.5 correspond 
to normalized score limits of ^2.58. Also slight differences in grouping 
in the extremes of a distribution, such as might be brought about by 
vaiying degrees of skewness or kurtosis, will have a very pronounced 
effect upon the extreme normalized scores. For example, in a distribu¬ 
tion of 200 cases, if one person makes the highest raw score his per¬ 
centile score is 99.75, and his normalized score is 2.81, as shown in 
Table 6. If five persons of the 200 tie for top score, the percentile score 
for this group is 98.75, which is only one point lower than the score 
obtained by the top one person. However, the normalized score equiv¬ 
alent is 2.24, or more than half a standard deviation less than 2.81—the 
normalized score for the top ranking one . Such apparently slight 
differences in groupings can make very serious differences in reported 
test results. If normalized scores on different tests are to be compared, 
it is important to be sure that slight differences in groupings in extreme 
cases do not occur, and also to be certain that the groups are similar in 
size; otherwise the results reported for normalized scores will be influ¬ 
enced more by the size of the group and by slight differences in grouping 
in the extremes than by the abilities of the students. 
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The computation of normalized scores is illustrated in Table 6. First, 
we compute the percentile score equivalents as illustrated in Table 5. 
Then from a table of the normal curve we read the base line values 
(normalized scores) that correspond to the various areas under the 
curve (that is, the percentile scores). 

TABLE 6 

A Worksheet for Recording Normalized Scores 


X 

/ 

V 

n 

120-129 

2 

0.50 

-2.58 

130-139 

3 

1.75 

-2.11 

140-149 

12 

5.50 

-1.60 

150-159 

23 

14.25 

-1.07 

160-169 

37 

29.25 

-0.55 

170-179 

51 

51.25 

+0.03 

180-189 

39 

73.75 

+0.64 

190-199 

21 

88.75 

+1.21 

200-209 

9 

96.25 

+1.78 

210-219 

2 

99.00 

+2.33 

220-229 

1 

99.75 

+2.81 


Columns X, /, and p are taken from Table 5. Column n gives the normalized score, 
n is read from a table of the normal curve by entering it with the values p or 1 — p. 

The use of normalized scores is indicated if there is reason to believe 
that the ability measured by the test is normally distributed and that 
defects in the test make the distribution of gross scores non-normal. 
Normalized scores on different tests are not comparable unless the 
groups are of similar size, and the distribution of extreme scores is simi¬ 
lar in both distributions. 


9. Standardizing to indicate relation to a selected standard 
group—-McCall’s T-score; Cooperative Test Scaled Scores 

In order to give a- common reference point for various scores, it has 
been suggested that some standard group be chosen and carefully de¬ 
fined, and that then the scores of all individuals be referred to that 
group regardless of whether or not the individual is a member of that 
group. 

For example, McCall (1922) suggested/that a normalized scale for 
general use in standardized tests be based on scores of 12-year-old chil¬ 
dren. He suggested that the mean normalized score for 12-year-olds be 
called 50, and the standard deviation of the normalized scores for 12- 
year-olds be fixed at 10. He suggested that such scores might be called 
T-scores (in honor of Thorndike and Terman), and that all individuals 
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might be scored on this scale regardless of whether or not they were 
12 years old. McCall suggests the use of normalized scores, but he does 
not explain how we are to find out what gross score corresponds to very 
extreme normalized scores, such, for example, as plus or minus five or 
six standard deviations away from the mean. This difficulty in extra¬ 
polation has been overcome in subsequent expositions of the T-score 
by making it a standardized score or a linear derived score with mean 
50 and standard deviation 10 based on a group of 12-year-old children. 
Although this change in McCall’s original idea (see Hull, 1928, pages 
166-171, for example) makes it possible to extend the scale somewhat 
farther than the range of the original group, it still is rather a meaning¬ 
less standardization to include in a test for 12-year-olds items suitable 
for first-grade children that will be answered correctly by all the 12-year- 
old group, and items suitable for college students that will be answered 
correctly by essentially none of the 12-year-old group. The only usable 
type of solution for such a problem seems to lie in the devising of methods 
for putting several different groups on the same scale. Thurstone’s 
absolute scaling methods and the Cooperative Test Service system of 
Scaled Scores illustrate such methods. 

Thorndike suggested that successive groups be normalized on the 
same scale. He suggested making allowance only for differences in the 
means of the various groups. If the normalized score of one group is 
designated by X and the normalized score of another group by Y, 
Thorndike’s method amounts to equating the groups by using only the 
assumption that X = Y + C. The score when related to the X-group 
will differ by a constant from the score related to the F-group. He 
assumed that the means of the groups differed but that the different 
groups each had the same standard deviation. In using this method to 
standardize items, it was found that scale values of items varied syste¬ 
matically from one group to another. Thurstone suggested that more 
freedom be allowed in equating the groups. His suggestion was that 
all the groups be assumed to be normally distributed on the same base 
line, but that it be assumed that the means and standard deviations of 
the different distributions might be different. Thurstone’s method of 
absolute scaling based on this assumption has been found to give con¬ 
sistent results in several instances in which it has been used. Gardner 
(1947), working with Rulon and Kelley at Harvard, has suggested that 
another degree of freedom be allowed in trying to match several different 
distributions to the same base line. He has assumed that the groups 
may differ in mean, in standard deviation, and in skewness. That is, 
the distributions need not have the zero skew characteristic of the nor¬ 
mal distribution. Gardner has used this method in analyzing score 
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distributions for tests given at various grade levels, and has found 
definite skewness differences from grade to grade. 

The Scaled Scores of the Cooperative Test Service are similar to the 
absolute scaling units Thurstone has suggested, in that the different 
groups used are assumed to be normally distributed with different 
means and standard deviations, on the same basic scale. The Coopera¬ 
tive Test Service Scaled Scores are based on the performance of a group 
of average white children in the United States at the completion of a 
particular course in a typical school with the usual instruction in that 
subject. Such a group is assumed to be normally distributed with a 
mean of 50 and a standard deviation of 10. It is clear that, in selecting 
cases for such a standardization, there must be a number of somewhat 
arbitrary decisions and assumptions. Thurstone made no suggestions 
regarding any arbitrary value for a mean and a standard deviation. He 
pointed out that the standard deviation of some selected group could be 
termed unity or ten, and that a zero point could be chosen three to five 
standard deviations below the mean of the lowest group. 

A system for normalizing several distributions on the same base line 
that is rigorous and complete with significance tests and confidence 
intervals has not yet been devised. The procedure described by Flan¬ 
agan (19396) is an iterative one and uses only the points corresponding 
to the median score of each of the distributions considered. Since 
Thurstone’s procedure is simple and direct, requiring no successive 
approximations, we shall describe it here. Flanagan (19396) has de¬ 
scribed the Cooperative test procedure and has worked out an illustra¬ 
tive example with both his own and Thurstone’s method. 

In his bulletin on Scaled Scores, Flanagan (19396) indicates that it 
was Kelley who suggested 50 as the mean for the average child, subject 
to the average training. The concept was developed in connection with 
Kelley's unpublished Universal Grading System. 

10. Thurstone’s absolute scaling methods for gross scores 

Thurstone's absolute scaling procedure as applied to test scores 
involves the following steps. 

1. Give the test to two or more groups, so that there will be a marked 
overlap in the distributions of adjacent groups. We shall illustrate 
with two such groups, a and 6. 

2. Select ten or twenty gross score points ( Xi ), so that percentile 
scores (and hence normalized scores) can be determined for both 
groups a and 6. 

3. Determine the normalized scores (K to and Ya) for groups a and 6 
corresponding to each of the selected gross score points X{. 
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4. Plot Yia on the ordinate against Y ib on the abscissa for these 
selected points. 

5. If the two groups can each be normalized on the same base line, 
this plot will be linear. 

In order to show this, let us assume a basic scale of values V t , in 
terms of which both groups are measured. If My a , and sva designate 
the mean and the standard deviation of the a-group in these standard 
units, any given score Vi may be expressed in terms of this mean and 
standard deviation as 

(7) V t = Mva + Y ta Sy a , 

¥ 

where Y ta , the normalized score with respect to the a-distribution, indi¬ 
cates the number of standard deviation steps between the mean ( M\ a ) 
and the score (V t ). Such an equation, with a different value of Vi and 
Y ta , and the same value of My a and s^a, applies for each of the gross 
score points selected for the comparison. Similarly, each of these points 
may be referred to distribution b instead of distribution a, and another 
set of equations written. These equations are 

( 8 ) Vi = M V b + Y tb svb- 
Equating these for successive values of V lt we have 

(9) M Va + I'mSFa = Myb + Y t bSVb > 

where Mva and Myb are the means in hypothetical absolute units for 
distributions a and 6, 

sva and syb are the standard deviations in hypothetical abso¬ 
lute units for distributions a and 6, and 
Y^ and Y tb are the normalized scores for distributions a and 6, 
respectively. 

This fundamental equation as applied to test scores is given by 
Thurstone (1938). Since the M ’s and s ’s are constant regardless of the 
varying values of the F's, we have a linear relationship between Y{ a 
and Yu that may be written 



That is, if it is possible to normalize both the a and the b distributions 
on the same base line, by assuming only different means and standard 
deviations, the plot of the normalized scores (Yia) against the normal¬ 
ized scores (Ya) will be linear with a slope equal to the ratio of the 
standard deviations, and an intercept equal to the difference of means 
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divided by one of the standard deviations. If one mean and one stand¬ 
ard deviation are known, it is possible to solve for the other mean and 
standard deviation. Thus, by assuming a mean and standard devia¬ 
tion for the a group, the mean and standard deviation of the b group 
can be computed. Then this computed mean and standard deviation 
for the b group can be used in the 6-c comparison to solve for the mean 
and the standard deviation of c. 

It must be noted, however, that the entire process of equating the 
scores of the various distributions is dependent upon the linearity of 
the plot Yi a against Y^. If this plot deviates markedly from the 
linear, we must conclude that both these groups cannot be normal on 
the same base line. The equating procedure is indicated to be impos¬ 
sible on these assumptions, and cannot be carried out legitimately. At 
present we do not have significance tests to determine when the pro¬ 
cedure is legitimate, and when not. It is necessary to use judgment 
regarding the seriousness of deviation from linearity until significance 
tests are developed. 

Thurstone (1938) has applied this method and has shown that a 
national distribution of 40,229 A.C.E. scores, the distribution of 646 
University of Chicago freshmen, and of 113 test subjects volunteering 
for the primary mental abilities battery may all be regarded as normal 
on the same base line. 

The absolute scaling technique makes it possible to plot the frequency 
distributions of /many different groups on the same base line. With 
units so established, it would be possible to indicate something about 
the nature of the mental growth curve for different types of mental 
functions. Thurstone has applied such scaling methods to Binet test 
items from different ages, and he finds a mental growth curve that is 
slightly negatively accelerated although it is still rising rapidly at age 
14; see Thurstone (1925). He also applied the same absolute scaling 
method to the completion test data collected by Trabue (Thurstone, 
19275), and he found a growth curve with only a very slight negative 
acceleration that was still rising very rapidly at grade 12. 

11. Standardizing to indicate age or grade placement 

One of the methods currently much in use for scaling of test scores is 
to express the results in terms of the subject’s standing with respect to 
one of several possible standard groups. Mental Age units and Educa¬ 
tional Age units are examples of this type of scaling. In the case of 
•Mental Age units, the individual is given a score that represents the 
“age group to which he belongs on account of his test score.” Simi¬ 
larly, the Educational Age units are used to indicate the grade group 
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that the individual resembles. Thurstone (1926) has shown the unsat¬ 
isfactory nature of the Mental Age unit as well as the ambiguity of 
definition of such a unit. 

An Educational Age of 8, for example, may be assigned to the average 
score made by all eighth-grade students; or it may be assigned to a 
score X{ f which is selected so that the eighth grade is the average grade 
of persons who make that score. These two definitions will not lead to 
the same set of norms. The first definition corresponds to the regres¬ 
sion of test score on grade placement, and the second corresponds to the 
regression of grade placement on test score. 

In order to interpret such age or grade norms, we must know which 
regression line has been used, and we must also know the amount off 
variation about that regression line. For example, suppose that grade 
norms are established on the basis of the regression of score on grade. 
Then a person who has a grade placement score of 8, for example, has 
made a score equal to the average of scores made by a representative 
group of eighth-grade students. Suppose we know that this student is 
in the sixth grade; it is then possible to say that he is two years advanced, 
in the sense that if he were put with a group of eighth graders he would 
score at the average of that group. However, we do not learn from such 
information alone how usual or unusual such a performance is. If such 
a student is a 95 percentile on sixth-grade norms, we know that only 
5 per cent of sixth-grade students are two years or more advanced. 
However, if this point is an 80 percentile on the sixth-grade norms, we 
know that there is a great deal of overlap between the successive grades, 
so much in fact that 20 per cent of students in the sixth grade are at or 
above the score made by the average eighth grader. 

The same type of remarks apply to any other set of norms based on 
successive groups, whether they are age groups, grade groups, height 
groups, or some other type of grouping. To know that a person is two 
or three years advanced or retarded in a given characteristic becomes 
much more meaningful if we also know the percentage of his group that 
is advanced or retarded an equal or greater amount. 

Similar considerations apply if the other regression line is used. Sup¬ 
pose, for example, that we are using the regression of chronological age 
on test score. Then the age equivalent would be the average age of 
persons making the same score. Suppose that the average age of per¬ 
sons making a given score is ten, and that the student whose score is 
being interpreted is eight years old. We know that he is with a group 
that is on the average two years older than he is. Again if only 5 per 
cent of the students making that score are under eight years of age, we 
are dealing with a relatively unusual degree of advancement. How- 
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ever, if 20 or 25 per cent of students making that score are under eight, 
we are dealing with a somewhat more common degree of advancement. 

Figure 1 illustrates the difference between these two modes of proce¬ 
dure. Line A is the regression of score on age—the average score made 
by persons of each chronological age. According to this line a score of Xi 
is equivalent to an age level of b years, whereas a score of X 2 is equiv¬ 
alent to an age level of c years. On the other hand, the line B is the 
regression of age on score—the average age of those persons making a 
given score. According to this line, the age level corresponding to a 


e 


b 

a 

$ 


Figure 1. Illustrating the difference between the regression of score on age, and 

age on score. 

score of X\ is a years, and the age level corresponding to a score of X 2 
is b years. It will be noticed that, for all scores above the mean, the 
regression of age on score will give lower age equivalents for any gross 
score level than will the regression of score on age. For scores below 
the mean, the regression of age on score gives higher age equivalents 
than does the regression of score on age. It will also be noticed that the 
“age difference” corresponding to any two scores is very large if the 
regression of score on age is used, and it is small if the regression of age 
on score is used. 



x t x t 

Score 
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It is interesting to note that, if the regression of age on score is used, 
as tests become more unreliable children above the average will be re¬ 
ported as less advanced than they are, and children below the average 
will be reported as less retarded / than they are. If the regression of 
score on age is used, children who are above average are reported as 
very remarkably advanced, and children below average are reported as 
very markedly retarded. 

This effect is demonstrated in Figure 2. Line A is the regression of 
test score on age, and line B the regression of age on test score for a very 



Figure 2. Showing the effect of test unreliability and regression line used, upon 

norms. 

reliable test that correlates highly with age. Since lines A and B are 
close together, it makes little difference which regression line is used 
when the correlation between score and age is high. If the test is short¬ 
ened and becomes unreliable, line A will tend to move into a position 
such as line a, and B will move toward line b. Line a then represents 
the regression of score on age for a relatively unreliable test, one that 
does not correlate very high with age. Line b represents the regression 
of age on score for such a test. 

By taking any illustrative score level above the mean, we see that 
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such a child will appear more advanced if the regression of score on age 
is used, and less advanced if the regression of age on score is used. In a 
similar manner, for any score less than average, the unreliable test will 
minimize the degree of retardation if the regression of age on score is 
used, and exaggerate it if the regression of score on age is used. The 
only method of taking account of such effects is to report the error of 
measurement of the test and the variability about the regression line 
which is used. Once this is done, the difference between the test repre¬ 
sented by lines A and B and the unreliable test represented by the dotted 
lines a and b becomes apparent in the norms. 

Whenever a test is reported in terms of many different standard 
groups, as in the case of age norms or grade norms, it is essential to 
know: 

1. Which regression line is used. 

2. The variability of the standardization group about that regression 
line. 

3. The error of measurement of the test. 

Unless we have this information it is impossible to estimate the degree 
to which two or three years retardation or advancement indicates a 
marked deviation from normal performance or a marked unreliability 
in the test. 

To illustrate the same sort of logic with conventional height-weight 
norms, we may point out that such norms are usually constructed to 
give the regression of weight on height. That is, to use the norms, first 
find your height, and then note the average weight for persons of your 
height. Such information is of value in that it tells you how many 
pounds it is necessary to gain or lose in order to be of average weight for 
your height. Since it is not as easy to alter height as weight, the norms 
showing the average height for persons of your weight would not give a 
useful item of information. However, the usual height-weight norms 
do not tell anything about how usual or unusual your particular weight 
is for your height. Some percentile tables would be useful in indicating 
to each person that he was within the weight range of 50 per cent of 
persons of his height, or was heavier or lighter than all but 5 per cent 
of persons of his height. Such added information would be of value in 
indicating whether the person was unusually over- or underweight. In 
reporting norms on older tests, various types of quotients became 
popular. Not only was the test scored, for example, in terms of Mental 
Age, with no reference to variability of Mental Age attaching to a 
given chronological age, but the child's Mental Age was divided by 
the chronological age to obtain a quotient, known as the Intelligence 
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Quotient or the I.Q. Similarly, the grade placement indicated by the 
test score was called Educational Age; and the Educational Age was - 
divided by chronological age to obtain an educational quotient or E.Q. 
It was also suggested that one of these quotients be divided by the other 
to determine an accomplishment quotient or A.Q. 

Since we need the error of estimate and the error of measurement in 
order to make any reasonable interpretation of norms such as Mental 
Age or Educational Age, it would seem clear that further routine divi¬ 
sion would only make the scores more and more difficult to interpret. 
As Thurstone (1926) has pointed out, the best procedure is to abandon 
the various quotient type units, as well as the Mental or Educational 
Age units, and to use normalized or standard score type units referring a* 
given case to several different sets of norms if necessary. We could then 
say that this eight-year-old child has a percentile score of 92 on eight- 
year norms, and one of 50 on the eleven-year norms. Such a system 
would reflect both the typicality or atypicality of the child, and the 
rate of advancement of the group in the trait or skill in question. 

It should also be noted that the relationship between different norms 
is changed by social customs. For example, the relationship between 
age and grade norms is affected by changes in the educational customs 
regarding promotion from grade to grade. In the early 1900’s promotion 
was based primarily on achievement. The pupil who did not learn as 
rapidly as the average was not promoted. Such an educational system 
would give rise to a marked difference between age and grade norms, 
and also lead to a smaller dispersion of scores within each grade, accom¬ 
panied by less overlap in the scores of adjacent grades. The present 
custom of promoting a pupil primarily on the basis of age will increase 
the resemblance between age and grade norms, increase the dispersion 
of scores within a given grade, and produce a marked overlap in the 
scores of adjacent grades. Norms that were determined under the 
former system of promotion cannot be compared with norms estab¬ 
lished under the present system of promotion primarily on the basis of 
age. Similarly, norms that have been established under limited educa¬ 
tional opportunities, and when the illiteracy rate is high, cannot be 
expected to resemble norms established when the educational level of 
the population is increased, and the illiteracy rate is low. 

12. Standardizing to predict criterion performance 

If we are dealing with a situation where predicting a criterion per¬ 
formance is desired, the proper regression line is readily indicated. For 
this purpose the regression of criterion score on test score is the correct 
one to use and will give the best predictions in the sense that this line 
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gives the average criterion score obtained by the persons making each 
given test score. The regression of test on criterion score will syste¬ 
matically give overpredictions of the performance of those scoring 
above the mean, and underpredictions of the performance of those 
scoring below the mean. 

In using either regression line, we must note the possible effects of a 
change in the population. For example, if we have established a re- 



Figure 3. Showing the effect on regression line of selection of the group on some 
variable related to criterion performance. 

gression line and a cutting score at level on Figure 3, and we find that 
the high-level applicants are attracted to other types of jobs, the new 
group of applicants will be in general considerably lower than the 
standardization group. If this selection is made in terms of variables 
that are correlated with criterion score more than with test score, the 
effect will be to lower the regression line as indicated in Figure 3 so 
that, in order to have as good a quality of selected persons it would be 
necessary to raise the cutting score to some such point as X/. Con¬ 
versely, if a depression throws a large number of highly qualified persons 
on the market, we should be dealing with a higher regression line; and, 
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if the old cutting score (X{) were maintained, it would be expected that 
the quality of work obtained from the selected applicants would be 
increased. 

In summary, then, if we are using test scores to predict a criterion 
score, and wish to set a cutting score such that the average criterion 
score of those at the cutting score will be some fixed value, then if there 
is a shortage of high scoring and a surplus of low scoring persons, it is 
necessary to raise the cutting score. Whereas, if there is a surplus of 
high-scoring persons, or a shortage of low-scoring persons, it is necessary 
to lower the cutting score in order to have the average criterion per¬ 
formance of those at the cutting score remain at a specified level. That 
is, the adjustment required to maintain a given level of performance at 
the cutting score is the opposite of what we should wish. 

In part the decision for the use of the regression of x on y or of y on x 
may depend upon which variable is made the basis of selection of cases 
for the standardization group. For example, if there is reason to be¬ 
lieve that we have a representative sample of eight-year-old children 
or nine-year-old, ten-year-old, etc., we might use the regression of score 
on age and expect the regression of score on age found for that group to 
be duplicated in future samples. The regression of age on score {aver¬ 
age age of those making a given score) is indicated if we feel that the 
sample drawn is representative of all ages making a given score. That 
is, if there have been no influences at work that would select with respect 
to age of the population, the regression of age on score is indicated. 

If we wish to use the regression of criterion on test score, the group 
may be selected explicitly on the basis of test score without biassing the 
regression line, but within the test score range selected there must be no 
selection on the basis of criterion score. For example, if workers who 
do not show a certain minimum production record are dismissed, and 
hence not included in the standardization group, the regression of 
criterion on test score will not be correct. We may select on the basis 
of the independent variable without biassing the regression of the de¬ 
pendent on the independent variable. There must be no selection on 
the basis of scores on the dependent variable or the regression line will 
not be correct. 

13. Marginal performance as a guide in determining cutting 
score 

In determining cutting score or in deciding on possible changes in a 
cutting score, it is sometimes helpful to consider the performance of the 
“marginal” student—the student immediately above or below the 
proposed cutting score. Figure 4 illustrates a correlation scatter plot 
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of criterion versus test score, with the regression of criterion on test 
score (F = aX + F — aX), and a critical level (L) for the criterion 
score. Let us consider any given test score array (C') as illustrated in 



0 1 2 3 4 5 6 7 8 9 10 11 12 13 

Test score, X 

Figure 4. Critical criterion level and regression of criterion on test score. 

Figure 4. The mean criterion score of this array is on the regression 
line, and it may be written as 

aXi+7 - aX. 

The standard deviation of this array—of any array—is the standard 
error of estimate, 

SyV 1 — T X y. 

For any given z-array, the critical criterion level L may be written as 
the deviation score, 

L-aXi -7 + aX. 

Or, written as a standard score, we have 

L - aXj — 7 + aX 
U s„Vl — r X y ’ 
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where Zj^ Is the deviation of the cutting score (L) from the mean of 
array (i), using the standard deviation of the array or the error of esti¬ 
mate as the standard unit. 

This quantity z Li may be computed for each of the possible test score 
values ( X {). These values may be converted into percentages by the 
use of a table of the normal curve. This series of percentages will show 
the percentage of persons that will be above (or below) the critical 
criterion level for each test score (X,). Figure 5 is such a graph, show¬ 
ing p, the percentage above the critical criterion score for each value of 
X{. A cutting score just below F would mean that the lowest persons 
accepted would have a 50-50 chance of being above the critical criterion 
level (see F in Figure 4 or 5). As the cutting score is moved below this > 
point, the lowest persons accepted have a better than even chance of 
being below the critical criterion level. If the cutting score is fixed at 
a level considerably above F, persons with a better than even chance of 
being above the critical criterion level are being rejected. The decision 
to move the cutting score away from the point F depends on judging 
either that the need for additional persons is sufficiently urgent to justify 
accepting those who are more likely to fail than to qualify, or that we 
can afford to reject a group that has a better than even chance of suc¬ 
cess in order to reduce the total number of failures. 



Test score, X 

Figure 5. Percentage above critical criterion level as a function of test score. 

This approach to selecting a cutting score can be quantified if it is 
possible to determine or estimate the ratio of the two quantities: H, the 
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cost of selecting a person who will fail, and G, the net gain from select* 
ing a person who will qualify. The cutting score should be at the point 
where for the marginal group the total gain from the successes will 
equal the total cost from the failures, or where 


pG = (1 - p)H. 


Solving for p, we find that 


H 

G+ H 


For example, if G = H, then, according to this equation, p = x /% In 
such a case, the cutting score would be at the point F in Figures 4 and 5, 
where the probability of success or failure is If the ratio R — H/G 
can be approximated, we have a solution as 


R 

R + 1 


14. Equating two forms of a test by giving ihem to the same 
group 

It is usually thought that when two forms of a test are given to the 
same group, no special equating problems arise. The procedure is to 
convert each form directly to standard, normalized, or percentile scores, 
and to assume that such scores are comparable since they were obtained 
on the same group for both forms of the test. It should be noted how¬ 
ever that a conversion to standard score makes adjustments only for 
differences in the mean and the standard deviation of the two forms. 
If the skewness or the kurtosis of the two forms differs, this difference 
will be reflected in the standard scores and will also be reflected in most 
cases in percentile or in normalized scores. 

For example, if a distribution of scores is negatively skewed, there 
will be some very low scores, but there will not be corresponding very 
high scores. This will be true of percentile or normalized scores just as 
it will be true of standard scores. With a positively skewed distribu¬ 
tion, the reverse will be true. We shall get a few scores that are very 
far above the mean, and no scores that are correspondingly far below the 
mean. If a distribution is markedly leptokurtie, some scores will be 
extremely low and others extremely high; whereas for a platykurtic dis¬ 
tribution the extremely low and high scores will be missing, and instead 
there will be a grouping of the subjects at mediumly low and mediumly 
high scores. 

This effect of grouping subjects may be illustrated concretely by 
considering the five top ones in a distribution of 200 cases. If all these 
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subjects make different scores, they will have percentile scores of 99.75, 
99.25, 98.75, 98.25, and 97.75. If all these subjects make the same 
score, all of them will get a percentile score of 98.75. Such a test cannot 
discriminate between the 97.75 and the 99.75 performance. There is no 
opportunity for even the best person to score higher than 98.75. With 
normalized scores this difference is even greater, since in the first in¬ 
stance the highest possible percentile score of 99.75 corresponds to a 
highest possible normalized score of 2.81, whereas, in the second case, 
the highest possible percentile score of 98.75 corresponds to a normalized 
score of 2.24. This is a normalized score difference of 0.57. In the cen¬ 
tral part of the distribution it would take a very much larger percentile 
difference to correspond to a normalized score difference of 0.57. For 
example, a percentile score of 50.00 corresponds to a normalized score 
of 0.00, and a percentile score of 71.57 to a normalized score of 0.57. 
The difference between having one or five persons grouped at the highest 
score changes the highest possible normalized score by as much as the 
difference between the fiftieth and th£ seventy-first percentile. 

If two tests have skewness and/or kurtosis coefficients that are 
radically different, it is difficult to define the meaning of parallel scores 
on the two tests. No set of standard, linear derived scores, percentile, 
or normalized scores will be parallel. 

As yet there is no statistical test available for equality of skewness 
and kurtosis. However, by inspecting the cases at the extremes of the 
distributions involved, it is possible to compare the highest possible and 
the lowest possible scores in two distributions, and to judge whether or 
not the difference is serious in terms of the decisions that are being 
made on the basis of the scores. In particular it is necessary to be care¬ 
ful when special action is being taken on the basis of extremely high or 
extremely low scores, as, for instance, if the best student is awarded a 
scholarship or if a few especially low students are dismissed. 

If three or more parallel forms are being standardized on the same 
group of persons, Wilks’ test for equality of variances and covariances 
given in Chapter 14 may be used. If the tests are not homogeneous with 
respect to covariances, no adjustment of norms can make the forms 
parallel in this respect. 

A second type of case arises if we are standardizing two forms of a 
test on the same group, and a criterion, which the test is to predict, is 
also available. In such a case, we may define the problem of equating 
test scores as matching the regression line of criterion on test for the 
two tests. Let us use the subscript c to designate the criterion, and x 
and y to designate the two tests. Then the problem of equating x and y 
for the purpose of predicting criterion c may be stated as follows. 
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Step 1 . Check to see that r ye = r xc or that (1 — r vc 2 )s e 2 = 
(1 — fxc 2 )s 2 - This check may be made on an approximate judgmental 
basis or by using the extension of Wilks’ criterion given by Votaw 
(1947) or (1948). If the criterion correlations of the two tests (or what 
is the same thing, the criterion errors of estimate) are essentially the 
same, it is possible to equate scores on the two tests, x and y. If the 
criterion correlations are different, equating scores for the purpose of 
predicting the criterion is not possible. 

Step 2. Express both x and y in terms of standard scores or linear 
derived scores, using the same mean and standard deviation. If there 
are slight differences in the criterion correlations of x and y, which may 
be attributed to sampling errors, it is possible to match the regression 
lines exactly by setting the mean of x equal to the mean of y, and making 
the ratio of the standard deviations equal to the ratio of the criterion 
correlations (that is, s x /s y — r xc /r yc ). 

Kelley (1947), pages 364-365, describes a method of establishing 
norms for a new test ( X 0 ) in terms of an “anchor test” (Zj) that is based 
on the use of the regression of X 0 on X\. If this method is used to de¬ 
termine equivalent scores for two parallel forms of a test, a systematic 
bias will result. As compared with the anchor test, the new test will 
have a smaller unit of measurement, and hence the numerical value of 
its standard deviation will necessarily be larger than that of the anchor 
test. 

In summary, when two forms of a test are given to the same group, 
converting each form directly to standard, normalized, or percentile 
scores does not necessarily and automatically result in comparable 
scores for the two forms. It is necessary first to see that the skewness 
and kurtosis coefficients are similar for the two tests. If the skewness 
and kurtosis are similar, standard, normalized, or percentile scores are 
comparable. If either or both of these coefficients are different, stand¬ 
ard scores will certainly not be comparable, and because of the possibili¬ 
ties of different groupings of scores at the extremes of the distribution, 
percentile or normalized scores are also likely not to be comparable. 

If in addition to the two forms of the test, criterion scores are avail¬ 
able on the standardization group, and the purpose of the test is to 
predict the criterion, it is necessary first to be certain that the criterion 
correlations of the two forms are similar, and then to use standard 
scores or some linear derived score for the two forms. Means and 
standard deviations or else means and regression slopes should be 
equated for the two forms. 



Chap. 19] Methods of Standardizing and Equating Scores 299 

15. Equating two forms of a test given to different groups 

A more complex and also more usual case of equating arises when 
form F (given to group A) is to be equated to form Z (given to group B). 
This is usually done by means of another test or segment of a test that 
will be designated X, which is administered to both groups. The theory 
for equating test Y given to group A with test Z given to group B by 
means of test X given to both groups has been developed by Ledyard 
Tucker (unpublished manuscript) for the case of standardized or linear 
derived scores. 

The equating “test” X mentioned above may be a single test or sub¬ 
test, yielding only one score, or it may be that several equating variable^ 
(X g , g = 1 • • • K) will be available. We shall first consider the case 
where only one equating variable is available (g = 1), and then the more 
general case where K equating variables are available. 

Since standard scores or linear derived scores are dependent entirely 
on mean and variance, the problem may be stated as that of estimating 
the mean and variance of test Y for group B. This mean and variance 
would then be arbitrarily assigned to the new test Z that was given to 
group B. The Z-norms would thus be comparable to the F-norms in 
the sense that the mean and variance of the transformed Z-scores for 
group B would be the same as they would have been if test F had been 
used on group B. 

Let us use a subscript set in roman type to designate the group, a 
bar over the variable to designate the mean, and a wavy line to desig¬ 
nate the standard deviation. In this notation, we may say that the 
problem is to estimate F B and F B (the mean and standard deviation of F 
for group B) from the known items of information, Xa, Xa, -Fb, ^b, 
Fa, and Fa (the mean and standard deviation of F for group A and of 
X for groups A and B). 

Making use of the equation of the regression line and the deviation 
score notation, we may say that 

(11) m = axi + e { . 

The score of the ith person on test y is equal to a times his deviation 
score on test x plus an error, We may change to gross scores by 
substituting F t — F for y iy X» — X for x iy and write equation 11 ex¬ 
plicitly for each of the groups A and B, as follows: 

(12) Fa< = claXai + Fa — o>aXa + ?Ai 
and 

(13) Fb* = a B X B t + F B — afiXfi + 
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Since complete information is available on group A, equation 12 pre¬ 
sents no problem. However, for group B only the X-scores are avail¬ 
able; hence some estimates must be made regarding the constants in 
equation 13. It seems reasonable to assume that the slope and intercept 
of the regression of Y on X for group B are equal, respectively, to the 
slope and intercept of the same regression for group A, that is, 

(14) a A = ob 
and 

(15) Fa — a A Z A = Fb — obFb. 

Summing equation 13 and dividing by Nb to obtain the mean of Ybu 
we have 

(16) Fb = ob-Fb + Fb — ub-Fb + ^B- 
If we assume that 


(17) e A = 6b = 0, 

and substitute equations 14, 15, and 17 in equation 16, we obtain 

(18) Fb = Fa + oa(Fb — X A ). 

Equation 18 expresses the 7-mean for group B in terms of known 
quantities. 

The value Fb given by equation 18 is the arbitrary mean to 
be assigned to the B-group in order to have the scores com¬ 
parable with the Y-scores of the A-group. This is the value 
to be used as M w in equation 6. Equation 18 is derived from 
assumptions that are the same as those used in the equations 
for group heterogeneity in Chapters 10 to 13. 


To obtain the variance of Ybi, we write equation 13 in deviation score 
form as in equation 11, and take the sum of the squares of the devia¬ 
tions over Nb, obtaining 


(19) 


N N 

iC Vl*i 2 ( a B^’B i + i ) 2 

»~1 


N N 


Since the correlation between x and e is zero, Xxe is zero. Expanding 
the right side of the equation, and writing Fb 2 for the variance of Y 
and E 2 for the error variance, we obtain 
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( 20 ) 

or 

( 21 ) 


*V = 


N N 

«B 2 23 *B i 23 e B i 

*“1 ^ issxl 


2 


AT 


AT 


Fu 2 = ob 2 AV + itf. 

Likewise, for the A group we have 

(22) f a 2 « «a 2 J?a 2 + £a 2 . 

From equation 22 we can solve for E\ 2 in terms of Fa and %a- If we 
make the assumption that ^ 

(23) Ex 2 = /V, 

we may write F B 2 entirely in terms of known quantities as 

(24) f B 2 = IV + a A W - 

Equation 24 expresses the ^/-variance for group B in terms of known 
quantities. 

The value Fb given by equation 24 is the arbitrary standard 
deviation to be assigned to the B-group in order to have the 
scores comparable with the Y-scores of the A-group. This is 
the value to be used as s w in equation 6. Equation 24 is 
derived from assumptions that are the same as those used in 
the equations for group heterogeneity in Chapters 10 to IS. 

Equation 24 is identical with equation 20 of Chapter 11. 

If K equating variables are available, the derivation of the Y mean 
and variance for group B follows the same general pattern, except that 
a multiple regression equation is used instead of the regression line of 
equation 11. To correspond to equation 11 for the multiple-regression 
case, we write 


(25) 


Vi = 13 a s x >e + ei- 

t-i 


From equation 25, the equations corresponding to equations 12 and 13 
are written as 

K K 

(26) Fa* = 23 oa*-Xa»* + Fa — 23 » + ca» 


and 

(27) 


*-i 


*-» 


K K 

Fb» = 23 GB*-Xb«« + Fb — 23 <W?Bf 4- «B <. 
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To obtain the mean Fb, sum equation 27 and divide by JVb, obtaining 

K K 

(28) 7 b = Z a B^B-g — Z + Fb + 

g=l g=l 

To correspond to equations 14 and 15, we assume that the regression 
coefficients, a g , for group B are equal to those for group A, and that the 
constant term for group A is equal to that for group B. These assump¬ 
tions give the K + 1 equalities, 

(29) a Ag = a Bg (ff = 1 • • • K) 

and 

K K 

(30) Fa — Z a Ag X A . g = Yb ^Bg^LB-g. 

g=i g«i 

Substituting equations 17, 29, and 30 in equation 28 gives 

K 

(31) F B = F a + Z <*Ag(X B .g - Xw). 

For the general case of K equating variables , equation 81 
gives the Y-mean for group B. This is the value to be used 
for M w in equation 6. The derivation uses the same as¬ 
sumptions as those of Chapters 10 to 18. 

In order to obtain the F-variance for group B, write equation 27 in 
deviation score form as 

K 

(32) Ybi — F B = Z a Bg(^B»g “ ^Bg) + e Bt \ 

g=i 

Using the lower-case symbols to designate deviation scores gives 

K 

(33) VBi = Z a BgXBig + ^Bi* 

S=1 

To obtain N times the variance of y, square both sides of equation 33, 

N 

and sum. Noting that all terms of the form Z x u e i are ze ro > we ma Y 

t-i 

write 

N N r K -]2 N 

(34) Z 2/Bi 2 = Z a BgXBig + Z e B i 2 . 

i~l ie»l Lg«l J isel 

The first term on the right side of the equation may be expressed as a 
triple summation, and the order of summation may be altered, giving 
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N K K N N 

(35) 23 2/Bi 2 ~ 23 23 23 ^BgO>Bh x Big x Bih + 23 e B* 2 * 

i=l g=l /i=l i=l i=l 

Equation 35 may be simplified by the notation 

N 

(36) 23 x ig x ih = jVc g *. 

i=l 

If q h, c is a covariance. If g = A, c is a variance. For variables y 
and e, the sum of squares will be designated by NY 2 and NE 2 , respec¬ 
tively. Introducing these notational changes in equation 35 and divid¬ 
ing by N gives 

K K 4 

(37) Ffi 2 = 23 23 O'BgO'Bht'Bgh + ^B 2 - 

£=1 h=l 

Likewise, for group A, we have 

K 

(38) f A 2 =EI a Ag a Ah c Agh + fi A 2 , 

g-1 h=l 

from which the variance of Ea may be written as 

K K 

(39) E A 2 = V-EE a Ag a Ah r Agh . 

g= 1 

Using the assumption of equation 23, we may substitute equation 39 
in equation 37; then using the assumptions of equation 29 and simplify¬ 
ing the result, we find the solution 

K K 

(40) Pb 2 = Pa 2 + EE IE a Ag a Ah (.CB g h ~ c Ag h), 

g=l h=1 

where Fb 2 is the variance of variable Y for group B, 

Fa 2 is the variance of variable Y for group A, 

«Ag is the regression weight for variable X g in predicting Y in 
group A, and 

CAgh is the covariance (1/AO 23 %AigZAih* 

i=i 

For the general case of K equating variables, equation Ifl 
gives the Y-variance for group B. This is the value to be 
used for s w in equation 6 . The derivation uses the same as¬ 
sumptions as those of Chapters 10 to 18. 

The problem discussed in this section is equating test Z , given to 
group B, to test Y, given to group A by means of an equating test X or 
a set of K equating tests X g . The solution is to estimate the mean of Y 
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for group B by equations 18 or 31; and to estimate the variance of Y for 
group B by equations 24 or 40. These estimated values are then used 
as the arbitrary mean and variance (M w and $ w 2 ) to be assigned to the 
new test Z (see equations 5 and 6). This equating procedure is appro¬ 
priate for any linear derived scores. 

For non-linear transformations of gross scores no appropriate proce¬ 
dure has yet been suggested. For percentile scores, it may well be 
impossible to develop a rigorous equating method. For normalized 
scores, it may be that some adaptation of Thurstone’s absolute scaling 
methods will give a satisfactory solution. These normalized scores 
might then furnish a satisfactory basis for equating percentile scores. 
If and when a solution for the equating problem is developed for normal¬ 
ized and percentile scores, it is highly likely that the sampling errors in¬ 
volved will be very much greater than those found in equating on the 
basis of linear derived scores. If the magnitude of sampling errors in¬ 
volved in equating tests from one group to another are considered, it 
seems likely that linear derived scores have a distinct advantage over 
the non-linear transformations. 

16. Summary 

After the test papers have been scored, the next step is to assess the 
gross score distribution in terms of the average chance score K/A ) and 
the variability of chance scores 

(1) Sc = Q) VK(A - 1), 

where K is the number of items in the test and A is the number of alter¬ 
natives per item. The lowest score taken as indicating knowledge of 
the subject should be greater than K/A + 2 s c . 

Whenever the distribution is divided into groups, the distance from 
the lower bound to the upper bound of a group should be large with re¬ 
spect to the error of measurement of the test. 

It should be noted that, whenever several tests are used, the principle 
of successive hurdles makes for raising of passing standards, whereas 
permitting multiple trials lowers standards, particularly with unreliable 
tests. 

In converting gross scores to an arbitrarily specified scale, as in Navy 
grades, Civil Service ratings, and some college grading systems, a good 
procedure is to determine certain critical points (such as the lowest 
passing grade and the lowest honors grade) by expert judgment, and 
then use linear interpolation between these points. 

In most large-scale testing programs, the procedure is to convert 
gross scores to some predetermined scale that indicates the relative 
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standing of the individual in his group, and to report scores in terms of 
this scale. The transformations usually considered are: 

1. Linear transformations, termed standard scores, or linear derived 

scores. 

2. Non-linear transformations, of which the commonest are the 

percentile score, and the normalized score. 

Percentile scores represent the percentage of persons in a typical 
group scoring less than the person in question. Such scores are very 
easy to explain to persons unacquainted with testing, but they have so 
many disadvantages that percentile scores are not generally used except 
as auxiliary scores. Percentile differences or averages are not constant 
in meaning from the middle to the extreme of the scale, and the equating 
of different groups is difficult if not impossible. Normalized scores 
should in general not be used unless there is some good reason for be¬ 
lieving that the underlying distribution of ability is normal and is mis¬ 
represented by the distribution of gross scores. Thurstone’s Absolute 
Scaling Methods furnish one way of checking on this belief for two 
partly overlapping groups given the same test. The range of possible 
normalized scores is also sensitive to the number of cases in the group, 
and to the grouping of the extreme cases. This effect must be watched 
carefully if comparisons are to be made from group to group or from 
test to test. 

The various disadvantages of percentile and normalized scores has 
led to the general use of some linear transformation of gross scores, with 
a convenient scale specified by an arbitrary mean and standard devia¬ 
tion. For standard scores the computation equation is 



where z% is the standard score of the ith individual, 

Xi is the gross score of the same individual, and 
Mx and sx are the mean and standard deviation of the gross score 
distribution. 

For other types of linear derived scores the computing equation is 

(6) Wi = Xi + M w — Mx, 

where v>i is the linear derived score of individual i, and 
M w and s w are the arbitrarily specified mean and standard deviation 
of the linear derived scores. 

The other terms have the definitions given for equation 4. 
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The use of normalized scores referred to some standard group has 
been suggested by McCall (the T-score), by Flanagan (the Cooperative 
Test Scaled Scores), and by Thurstone (absolute scaling methods). 
Thurstone gives the fundamental scaling equation as 

(9) Mva + YiaSVa = Mvb + Y ib$Vb 

or 

/.rvx Tr ( SVh \x r , Mvb — Mva 

(10) Y ia = ( )Y ib + -, 

\SVa / SVa 

where My a and Mvb are the means in hypothetical absolute units for 
distributions a and 6, 

sva and svb are the standard deviations in hypothetical abso¬ 
lute units for distributions a and b, and 
Y^ and are the normalized scores for distributions a and 6, 
respectively. 

Equation 10 demonstrates that the normalized scores for two distri¬ 
butions will be linearly related to each other, if both distributions can 
be regarded as normal on the same scale. 

In standardizing to predict a criterion performance it is necessary to 
use the regression of criterion on test score and to give the corresponding 
error of estimate in order to use the norms properly. 

If, in addition to the regression of criterion on test score, we have a 
specified critical criterion level, it is possible from these two items of 
information to draw a curve showing the percentage of persons (p) 
above the critical level at each test score range. This graph can be 
used for determining the cutting score. If the ratio of H , the cost of 
selecting a potential failure, to (7, the net gain due to selecting a success¬ 
ful person, can be determined or estimated, the cutting score can be 
fixed at the test score .level, where p = R/(R + 1), R = H/G. 

Another type of standardization seeks to indicate the age or grade 
placement of the person. In such norms it is necessary to know which 
regression line has been used, and to know the nature of the sampling 
used for selecting the standardization group. It is also important to 
know the size of the error of measurement in relation to the size of the 
crucial steps in the norms. Without such facts as these, the degree of 
over- or underachievement of a person can easily be markedly exag¬ 
gerated or minimized. 

If two forms of a test to be equated have been given to the same 
group, it is possible to make two independent transformations to some 
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linear or non-linear scores. Such scores for the two forms, however, 
will not be comparable unless the skewness and kurtosis coefficients for 
the gross score distributions are similar. Also if we are standardizing 
in terms of a criterion, two forms cannot be regarded as parallel unless 
the correlation with the criterion is approximately the same for the 
two forms. 

In standardizing two forms of a test, each given to a separate group, 
it is necessary to use some form of linear derived scores and to have a 
matching test. When linear derived scores are used, the only problem 
is to determine an appropriate mean and standard deviation for the 
second group. For a single equating variable, we have 

✓ 

( 18 ) Fb = Fa + oa(X b — *a) 

and 

(24) Fb 2 = Fa 2 + o A W - ^a 2 )- 

For K equating variables X g {g = 1 • • • K), we have 

K 

Fb = Fa + 2 ®A*(^B'g - -Fa-*) 

S=1 

K K 

Fb 2 = Fa 2 + 2 E 0AgfflAA(CBgA ~ c Agh), 

where Fa and Fa 2 are the mean and variance of Y for group A. 

(These are the original scores to which the 
B-group scores are to be matched.) 

X A) Zb, and X B 2 are the mean and variance of X for groups A 

and B, respectively. (X is the matching 
test that has been given to both groups A 
and B. Also X g) g = 1 • • • K indicates K 
matching tests.) 

a Ag is the regression weight for variable X g in 

predicting Y in group A. 
v 

Ca gh is (1/^F) 2J XAig^Aih (*^tg X\ g X g*). (If 

»*sl 

g h, this term is a covariance; if g - h, 
this term is a variance.) 


( 31 ) 

and 

(40) 
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Table for Use in Connection with Problems 3 to 7 

The following scores were made by 68,899 students in 323 colleges on the 1937 
edition of the American Council on Education Psychological Examination for 
College Freshmen. 

Frequency 


Scores 

Men 

Women 

Total * 

0-9 


2 

4 

10-19 

5 

3 

12 

20-29 

27 

22 

58 

30-39 

85 

50 

170 

40-49 

169 

112 

329 

50-59 

329 

225 

626 

60-69 

471 

358 

943 

70-79 

667 

479 

1,314 

80-89 

923 

769 

1,915 

90-99 

1,108 

892 

2,264 

100-109 

1,387 

1,171 

2,897 

110-119 

1,669 

1,376 

3,429 

120-129 

1,768 

1,529 

3,764 

130-139 

2,064 

1,768 

4,348 

140-149 

2,113 

1,793 

4,471 

150-159 

2,188 

1,830 

4,650 

160-169 

2,220 

1,748 

4,600 

170-179 

2,128 

1,798 

4,583 

180-189 

1,990 

1,610 

4,207 

190-199 

1,823 

1,479 

3,904 

200-209 

1,639 

1,351 

3,593 

210-219 

1,488 

1,251 

3,281 

220-229 

1,234 

996 

2,686 

230-239 

1,097 

906 

2,441 

240-249 

893 

748 

2,025 

250-259 

750 

596 

1,630 

260-269 

584 

488 

1,309 

270-279 

474 

329 

998 

280-289 

358 

273 

772 

290-299 

284 

187 

580 

300-309 

184 

122 

387 

310-319 

153 

74 

286 

320-329 

96 

52 

181 

330-339 

70 

38 

133 

340-349 

29 

13 

51 

350-359 . 

24 

9 

40 

360-369 

6 

3 

14 

370-379 

2 


2 

380-389 

1 


2 

Total 

32,500 

26,450 

68,899 

Lower quartilc 

127.27 

127.54 

128.67 

Median 

165.75 

164.84 

167.08 

Upper quartilc 

207.57 

206.10 

208.87 


M « 170.0214 



s- 57.7012 

* The total includes the scores of 9949 students not classified according to sex. 
Data taken from L. L. Thurstone and T. G. Thurstone, The 1937 Psychological 
Examination for College Freshmen, The Educational Record , April, 1938, pages 
209-234. 
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Problems 

1. Draw a graph corresponding to the following transformation. The distribu¬ 
tion of Table 1 is to be linearly transformed to a scale 00-99, such that a gross score 
of 150 equals 70 (the lowest passing mark) and a gross score of 210 equals 90 (the 
lowest honors mark). 

2. Give the average chance score ( c ), the variance of scores due to chance it 2 ), 
and the lowest gross score exceeding C -f 2c for each of the following tests: 

(а) A 20-item true-false test. 

(б) A 100-item true-false test. 

(c) A 20-item multiple-choice test that has one correct and four incorrect alter¬ 
natives for each item. 

id) A 100-item multiple-choice test, with 5 alternatives per item, as in c. * 

(e) A 10-item test, with 10 alternatives for each item. 

3 . Using only the total distribution given in the right-hand column of the fore¬ 
going table, compute the table, and draw the graph for transforming raw scores of the 
foregoing frequency distribution to (a) standard scores ( 2 -scores), ( b) linear derived 
scores, with a mean of 50 and a standard deviation of 10 (w-scores), (c) percentile 
scores (p-scores), (d) normalized scores (n-scores). 

4 . From the information in the preceding problem, draw the graphs showing the 
relationship between (a) 2 -scores and ic-scores, ( b) 2 -scores and p-scores, (c) 2 -scores 
and n-scores, id) 70 -scores and p-scores, (e) 70 -scores and 71 -scores, (/) p-scores and 
n-scores. 

Write a brief paragraph stating the relationships shown in the foregoing six graphs. 

5. As a check on normality use arithmetic probability paper and plot (a) p-scores 
against 71-scores, ( b ) p-scores against 70-scores. 

6. Using the distribution for men only as given in the foregoing frequency dis¬ 
tribution, compute the table and draw the graph for transforming raw scores of men 
to (a) z-scores, (6) 10 -scores, with mean 50 and standard deviation 10, (c) p-scores, 
(d) 71 -scores. 

7. Using the distribution for women only as given in the foregoing frequency 
distribution, compute the table and draw the graph for transforming raw scores of 
women to (a) 2 -scores, (6) t 0 -scores, with a mean of 50 and standard deviation 10, 
(c) p-scores, id) n-scores. 

Write a brief paragraph comparing the norms for men with those for women. 

8 . Below is given the frequency distribution of A.C.E. scores for 113 students 
taking the 56 tests used in Dr. Thurstone’s first large study of primary mental abili¬ 
ties. (Thurstone, 1938, page 19.) This distribution is given in terms of percentile 
points on the national norms for the A.C.E. test. The table shows that there was 
one student between the 35 and 40 percentile points on the national norms; two 
students between the 45 and 50 percentile points; and so forth. It will be noticed 
that over 25 per cent of the students are above the 98 percentile point on the national 
norms. Can this distribution of 113 cases be regarded as a normal distribution, 
granted the assumption of a Gaussian distribution of intelligence in the 40,000 
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students on whom the national norms were based? 
to answer this question. 

Use the absolute scaling methods 

Cumulative 

Scores * 

Frequency 

Frequency 

35-40 

1 

1 

40-45 

1 

2 

45-50 

2 

4 

50-55 

2 

6 

55-60 

4 

10 

60-65 

6 

16 

65-70 

2 

18 

70-75 

4 

22 

75-80 

10 

32 

80-85 

6 

38 

85-90 

10 

48 

90-95 

25 

73 

95-96 

3 

76 

96-97 

4 

80 

97-98 

3 

83 

98-99 

6 

89 

99-99.9 

23 

112 

99.9-100.0 

1 

113 


* National norms (1933), 40,229 cases, 203 colleges. 


9 . 



Criterion ( y) 

Selection Test (x) 


Minimum 

Set 

Mean 

Standard 

Deviation 

Mean 

Standard 

Deviation 

r X y 

Acceptable 
y -score 

A 

B 

9.11 

30.42 

2.95 

9.25 

9.78 

20.48 

3.12 

6.75 

.74 

.55 

6.00 

20.00 


For each of the foregoing sets of data, graph p, the percentage of acceptable 
^/-scores, as a function of x, the selection test score. Determine the cutting score 
appropriate for each set of data on the assumption that: 

(а) The net gain due to selecting an acceptable student is equal to the loss incurred 
by selecting one who will fail. 

(б) The gain is double the loss. 

(c) The loss is double the gain. 

10 . Form A of a spelling test was given to a sample of 1000 eighth-grade students 
in 1942. The mean number correct was 49.1, and the standard deviation was 13.6. 
This test was standardized in terms of derived scores with a mean of 500 and a stand¬ 
ard deviation of 100. In 1946, form B of the spelling test was given to a sample of 
1000 eighth-grade students. The mean was 61.3, and the standard deviation 14.8. 



Chap. 19) Methods of Standardizing and Equating Scores 311 

An equating test, form C, had been given to both groups of students when tests 
A and B were given, with the following results. For the second group, the mean 
was 23.4, and the standard deviation 6.8; for the first group, the mean was 21.3, the 
standard deviation 6.2, and the correlation tac was .75. It is desired to use linear 
derived scores for form B that will give scores directly comparable with the derived 
scores (mean — 500, standard deviation » 100), used for form A. 

(а) Write the transformation equation used on the 1942 group for form A. 

(б) Write the transformation equation that should be used with the 1946 group 
for form B. 

11. In 1944 a group of 1000 college freshmen were given two mathematics and one 
vocabulary test with the following results: 



Mean 

Standard 

Deviation 

Correlations 

Math. A 

137.6 

15.8 

tac = .81 

Math. C 

48.1 

7.3 

tad — .63 

Voc. D 

206.4 

25.7 

S 

il 

cn 


In 1947 another group of 1000 college freshmen were given two mathematics tests 
and one vocabulary test. The vocabulary test and the mathematics test C were 
identical with the tests given to the 1944 group. For the 1947 group, the results 


were as follows: 

Mean 

Standard 

Deviation 

Correlations 

Math. B 

172.7 

21.4 

rj qc — *85 

Math. C 

53.6 

8.5 

tbd =* .61 

Voc. D 

213.2 

28.9 

tcd = .49 


Test A has been converted to a linear derived score with a mean of 100 and a 
standard deviation of 20. 

(a) In order to make scores on test B comparable with the linear derived scores 
for test A, what arbitrary mean and standard deviation should be used for 
transforming test B? Use both tests C and D for equating. 

(b) Write the transformation for test A. 

(c) Write the appropriate computing equation to use for transforming scores on 
test B. 

12. One of the equations presented in connection with the discussion of the influence 
of group heterogeneity (see Chapters 10, 11, and 12) is analogous to one of the equa¬ 
tions in this chapter. Find and compare these two equations. 



20 

Problems of Weighting 
and Differential Prediction 


1. General considerations in determining weights 

When several test scores are available, on the basis of which a decision 
is to be made, we have the problem of the appropriate method of com¬ 
bining these scores. When a single total score is to be derived from a 
number of measures, this score should represent the standing of the 
candidates with respect to something. The type of judgment involved 
in determining what this something should be and various methods of 
combining scores will be considered in this chapter. 

It should be noted that it is not possible to dodge the weighting 
problem if any decisions are to be made. Occasionally we hear the 
suggestion that scores simply be added together without bothering 
about problems of weighting. No matter what scores we add, the weight¬ 
ing problem is not avoided. Adding the gross scores on a series of tests 
gives relative weights of one sort, adding standard scores gives relative 
weights of a different type. What information must be obtained, and 
what major questions must be answered in order to secure reasonable 
composite scores from pooling the components? 

It has also been suggested that a separate cutting score may be deter¬ 
mined for each test so that we should use a combination of cutting scores 
instead of a weighted score. Franzen (1943) has presented a type of 
“multiple chi” procedure for determining the best combination of cutting 
scores. This procedure consists essentially in trying all possible com¬ 
binations of various cutting scores, and then using the one that turns 
out to be best for the set of data in hand. Systematic short-cut compu¬ 
tational procedures are also presented by Franzen. 

In connection with multiple-cutting scores, it must be noted that 
policy changes that seem slight may in effect produce a marked difference 
in standards. In Figure 1 we see that, if a person must pass both tests 
to be accepted, only group 2 will be accepted. If passing either test is 
acceptable, groups 1,2, and 3 will be accepted. It should also be noted 

312 
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that, if those who fail the first time are allowed to try a second and a 
third time, this policy is equivalent to saying that the person is accept¬ 
able if he passes either test; hence many more will pass. 



Figure 1 . Illustrating the difference: between passing both and passing either test. 

The difference between multiple-cutting scores and a weighted com¬ 
posite may be seen by referring to Figure 2. In this figure the cutting 
scores are adjusted so that the same number will be passed by each 
policy. If both tests must be passed, we accept those above and to the 
right of line abc. If passing either test is acceptable, the person must be 
above or to the right of line def. In the first case, persons in areas 1, 4, 
and 5 are accepted; those in areas 2, 3, 6, 7, and 8 are rejected. In the 
second case those in areas 1, 2, 3, 6, and 7 are accepted; those in areas 
4, 5, and 8 are rejected. If the number of persons in area 4, plus the 
number in area 5, is equal to the total number in areas 2, 3, 6, and 7, 
the same number will be accepted by either system. Likewise, the 
number rejected will be the same for either system. The use of a 
weighted score is illustrated by the line gh. In using this line we accept 
those in areas 1, 2, 4, and 6, while rejecting those in areas 3, 5, 7, and 8. 
By all three methods everyone in area I is accepted, and everyone in 
area 8 is rejected. The methods differ only in the disposition of persons 
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with closely similar scores in the areas 2, 3, 4, 5, 6, and 7. It may be 
noted that the use of multiple-cutting scores results in putting areas 
4 and 5 (with difference scores near zero) together as either accepted 
or rejected. Also persons in areas 2, 3, 6, and 7 (with large positive 
or negative difference scores) are classed together as rejected or accepted. 
Thus the use of multiple-cutting scores would be justified by a curvi¬ 
linear relationship between the criterion and the difference score. If 



Figure 2 . Illustrating the difference between multiple cutting scores and a weighted 

composite. 

this relationship is linear rather than curvilinear, the use of a straight 
line such as gh would be more appropriate than a multiple-cutting score. 
Since in general linear relationships have been found adequate for most 
test work, we shall limit ourselves in this chapter to a discussion of 
weighting in terms of various linear combinations of scores. 

The first problem is to find the best way of describing a set of weights. 
It will be found that the ratio of the standard deviation of the distribu¬ 
tion of weights to the mean of the distribution is the best single number 
for characterizing a set of weights. The relationship between two sets 
of weights is represented by their covariance or correlation. It is im¬ 
portant first to note that, if the two sets of weights being considered are 
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similar to each other, the two weighted composites will correlate highly 
with each other. For example, if tests A, B, and C receive weights 
1, 2, and 3, respectively, or 1, 2, and 4, respectively, the resulting com¬ 
posite scores will be very similar. The weights 3, 4, and 5 will give 
essentially the same results as 2, 4, and 6. Unless we are considering 
radically different sets of weights, the resulting scores cannot be altered 
much by changing from one set of weights to the other. If the two sets 
of weights have a low intercorrelation, the correlation between the 
composites will be determined by the ratio of the standard deviation 
of the distribution of the weights to the mean of the distribution, and 
also by the properties of the test battery, such as the number of tests 
combined and the correlation between these tests. * 

It will be found that, if the standard deviation of the set of weights is 
very large in comparison to the mean, changes in weights used can 
produce great changes in scores regardless of the number of variables 
to be combined, and regardless of their intercorrelation. For example, 
if both positive and negative weights are permitted, the mean of the 
distribution of weights will be near zero, while the standard deviation 
will be very large. If freedom of this type is allowed in weighting, two 
composites may have very low correlations regardless of the number of 
variables combined and regardless of the intercorrelation of these 
variables. However, if the mean of the distribution of weights is about 
equal to or larger than the standard deviation of the distribution of 
weights and if the correlation between the two sets of weights is low, 
the correlation between the two composites will depend largely upon the 
number of variables involved and upon the intercorrelations of these 
variables. This case is important, since it is the usual one found in the 
weighting of items to give a total test score, and of tests in an aptitude 
battery to give a composite score. Limiting our consideration to sets of 
positive weights with low intercorrelations, we find that the composites 
will not be different unless there are relatively few variables to be com¬ 
bined and a low correlation among these variables. 

Thus we have seen that in considering the effect of weighting on a 
composite the test battery may be characterized by two variables: 
(1) the average intercorrelation between the tests and (2) the number 
of tests. The weights likewise may be characterized by two variables: 
(1) the ratio of the standard deviation to the mean of the distribution 
of weights and (2) the correlation between the two sets of weights. In 
order to demonstrate the effect of each of these factors on the correlation 
between the two composites, it is necessary to write an expression for 
the correlation between two weighted sums. 
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Let us consider a set of standard scores z and two different sets of 
weights designated by V and W. We have 

K 

(1) Xvi — ViZu + V 2 Z 2i H-h VK%Ki = S VgZgi. 

The composite score (XV) for individual i is equal to the sum of the 
products of ^scores for that individual, each multiplied by the assigned 
weight ( V g ). In like manner we may write another composite Xw, 
obtained by applying a difference set of weights (W g ) to the same set of 
standard scores, namely, 

K 

(2) X m = W 1 z li + W 2 z 2i + ■ ■ ■ + W K z Ki = £ W e z gi . 

*=1 

The composite score CXV) for individual i is the sum from g = 1 to K 
of the products of the 2 -scores for that individual, each multiplied by 
the assigned weight ( W g ). In order to indicate the influence of two 
different sets of weights, we shall write the correlation between Xv 
and Xw, 

9 N 

X) XviXwi 

( 3 ) RxvXw = i- jj .. ' 

It should be noted that, since the 2 -scores have a zero mean, the X-scores 
will also have a zero mean; hence the gross score formula for correlation 
need not be used. 

Substituting equations 1 and 2 in the numerator of equation 3 and 
expanding, we have 

(4) 2X y X w - V x W x 2z x 2 + V 2 W 1 Zz 1 z 2 +• • •+ V K W x 2z x z K 

+ V x W 2 Xz x z 2 + V 2 W 2 Zz 2 2 + ■ • ■ + V K W 2 Zz 2 z K + 


+ V x W k ^ZiZk + V 2 W k '2z 2 z k H-+ VrWkZzk 2 , 

where it is understood that all summations are over individuals 
(i ® 1 • • • N). Since the z’s are standard scores, 'Sz 2 = N and 
2 ZgZh = Nrgh, (g 7 * h). If we make these substitutions, indicating first 
the sum of the K diagonal terms and then the K 2 — K non-diagonal 
terms, we have 

N K K 

£ X Yi X Wi = £ V g TF g 2\T + £ V t W h r gh N. 

im il gsml 1 



(5) 
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It must be noted that there are K terms in the first summation and 
K 2 — K terms in the second summation. By substituting V for W 
in equation 5, we may write the first factor in the denominator of 
equation 3 as follows: 

(6) Z X ri 2 - Z V*N + Z V e V h r gh N. 

*'=1 g=l g**h=l 

By substituting W for V in equation 6, we may obtain an expression 
for the second factor in the denominator of equation 3. Substituting 
equations 5 and 6 in equation 3 and factoring N out of both numerator 
and denominator, we have * 


Equation (7) 
RxyXw = 


Z V g W g + Z V g W H r gh 

g=l g*=h=l 


I K 

JE 

\ g=l 


K 


V g 2 + Z 


Jit 

V g=i 


*V + Z F,TF*r rt 

g=*/l = l 


Again it must be remembered that the single-subscript summations 
contain K terms, whereas the summations involving both g and h con¬ 
tain the K 2 — K non-diagonal terms. Let us now consider the numer¬ 
ator term of equation 7. We may use Cvw to designate the covariance 
between the two sets of weights and write 

(8) Cvw = (^) VgWe ~ ?W ’ 

where V is the mean of the F’s and W is the mean of the W 9 s. Solving 
equation 8 for 2VW, we have 

(9) 2V g W g - K(C V w + VW). 

We may also introduce the concept of the covariance of r g h with the 
product VgWh , designated by C (V w)r- This term will in general have a 
lower bound of aero and an upper bound equal to the product of the 
standard deviation of r and the standard deviation of VW. We are 
limiting outselves here to the conventional case in which highly inter- 
correlated parts are given more weight than those with low intercorrela¬ 
tions. 

Following the form of equation 9, we may write 

Z V g W h r gh = ( K 2 - K){C m + OW). 

gv*h=ml 


(10) 
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Referring to equation 4, we see that the term (FIT) is the mean of only 
the non-diagonal product terms of the form V g H\. S umming the VW 
terms by columns gives 

V{SW + ViSW + • • • + FjcSTF = (2F)(2JF) = KVKW. 

To obtain (FIT) we deduct the sum of the terms in the principal diagonal 
and divide by K 2 — K, obtaining 


(ID 


(F1F) = 


KVKW - 2F g TF g 
K 2 - K 


Substituting equation 9 in equation 11, and then the rewritten equation 
11 in equation 10 and rearranging terms, we have 

K 

(12) £ V g WhXgh - (. K 2 - K)(C (V w)r + VWf) - KCvwf. 

Combining equations 9 and 12, we may write the numerator of equation 
7 as follows: 

K. K 

(13) £ VgWg + £ VgW h r gh = m - f)(Cvw + VW) 

g—1 g=*=/l=»l 

+ (K 2 - K)C iVW )r + K 2 VWf , 

where f is the average intercorrelation of the subtests, 

V is the average of the 7-weights, 

W is the average of the W -weights, 

K is the number of scores to be combined, 

Cyw is the covariance between the two sets of weights, and 
C(vw)r is the covariance between r g h and the product V g Wh • 

By substituting V for W in equation 13, we may write the first factor 
in the denominator of equation 7 as follows: 

K K 

(U) £ Vg 2 + £ V g V h r eh = K{\ - f)(P 2 + V 2 ) 

+ C K 2 - K)C (V V)r + K 2 v 2 f. 

The covariance term Cvw f it should be noted, changes into the variance 
of F, which has been designated by the symbol V 2 . By substituting 
W for V in equation 14, an expression can be written for the second term 
in the denominator of equation 7. Substituting equations 13 and 14 
in equation 7, we obtain the final expression for the correlation between 
the two weighted sums, 
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Equation (15) 

Rx v xw 

This equation expresses the correlation between two composites obtained 
by using different weights, in terms of 


KQ. - f)(Cy W + VW) + (K 2 - K)C (VW)r + K*VWf 

lK( 1 - r)(V 2 + F 2 ) lK( 1 - r)(W 2 + W) 

J + (K 2 - K)C(vv)r + (K 2 - K)C(ww)r 
\ + K 2 V 2 f \ + K 2 W 2 f 


Kj the number of scores to be combined, 

W and F, the averages of the W and V weights, 

and F, the standard deviations of the two sets of weights, 
f, the average intercorrelation of the scores to be combined, 

Cvwt the covariance between the two sets of weights used, and three 
terms of the form 

C(vw)r, the covariance of a product of weights with r g h . 


Horst (1941), pages 379-^401, contains a discussion by M. W. Richard¬ 
son of the principles to be followed and the precautions to be observed 
in deciding upon a set of weights. Richardson presents an equation 
analogous to equation 15, but derived from more restrictive assumptions. 

To see what happens as K increases, we may divide the numerator 
and denominator by K 2 and omit all terms which have (l/K) as a factor. 


This gives 
(16) 


_ C (VW)r + VWf _ 

JC(VV)r + V 2 f \IC(WW)r + W 2 f 


which is equal to unity if the covariance terms are equal and the mean 
F-weight equals the mean IF-weight. In particular, if the covariance 
terms are near zero they may be ignored; and in this case RxvXw ap¬ 
proaches unity regardless of the value of the mean weights. 

Also we learn from equation 15 that, if the average intercorrelation 
of the items (r) is near unity, the factor (1 — f) approaches zero, and 
equation 15 approaches an expression similar to that given in equation 
16. That is, when f in equation 15 approaches unity, Rx v x w approaches 
unity if the covariance terms are equal and if V = W. It is also true 
in this case that Rx v x w approaches unity if f approaches unity and the 
covariance terms approach zero, regardless of the values of V and W. 

It should also be noted that, if positive, zero, and negative weights 
in any combination are allowed, either or both V and W may be zero, 
and V and tV may be either large or small in relation to finite values 
of V and W. If such freedom in selection of weights is allowed, we can 
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learn from equation 15 and from the limit given in equation 16 that 
RxvXw may assume any value, regardless of the intercorrelation of the 
weights, the number of variables, or the average intercorrelation of 
these variables. 

If all the weights cannot be positive, we are considering the situation 
in which 7 and W are small in relation to 7 and W. Assuming that the 
terms containing 7 or W may be ignored as being small, we see that 
Rx v xi * depends primarily on the four covariance terms Cvw, C(vw)r, 
C(vv)rj and C(ww)r • The number of variables to be combined ( K) and 
the average intercorrelation of these variables (f) are of only minor 
importance in determining Rx v x w when 7 and W are both small. 

Let us examine equation 15 to see what happens as the correlation 
between the two sets of weights increases. Under this condition, the 
term Cvw approaches 7ff, and the covariance C(vw)r becomes similar 
to C(vv)r and C(ww)r so that, as ryw approaches unity, the value of 
RxvXw dependent upon the value of 7 and W. Thus we see that 
as ryw approaches unity, R Xv x w also approaches unity, provided that 

7 - W. 

We may also see that, as the standard deviations of V and of W are 
decreased, the covariance and variance terms in equation 15 decrease. 
In the limit these terms will vanish. Dividing the numerator and the 
denominator of the remaining terms by 7W, we find that Rx v x w 
approaches unity as the variance of the weights is decreased, provided 
that the terms 7 and W do not approach zero. 

Summarizing the information furnished by equation 15 and 
the limit given in equation 16, we see that: 

A. If either or both 7 and W may be zero, Rx v x w ma y 
assume any value regardless of the value of f, K, or the 
various covariance terms involving the weights . 

B. If 7 and W are small in relation to 7 and W, Rx v x w 
depends primarily on the four covariance terms Cvw } 

C(vw)rt C(yv)n and C(ww)n and is relatively insensitive 
to changes in the values of f and K. 

C. If we consider only positive weights so that 7/7 and 
ft/W are less than unity, the correlation between the two 
composites obtained by using two different sets of weights ap¬ 
proaches unity as (a) the correlation between the two sets 
of weights is increased, (b) the average intercorrelation of 
the part scores is increased, and (c) the number of scores 
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to be combined is increased. It should be particularly 
noted that this last effect holds, even if the correlation be¬ 
tween the two sets of weights (r V w) is zero, but that f must 
be greater than zero. ( d) As the standard deviation of the 
weights (V and W) is decreased in proportion to the mean 
weights (V and W), Rx v xw approaches unity regardless of 
the values of r, V, and W. 

From the practical point of view, we may say that, if a large number 
of scores are to be combined (of the order of 50 to 100) or if the scores 
have high intercorrelations, it makes relatively little difference what sets 
of positive weights are assigned. The computationally simplest set 
would probably be the best one to use. If, however, we are combining 
only a few scores (for example, three to ten), and the average intercorre¬ 
lation is low (.5 or less), differential weighting equations may profitably 
be considered. However, the set of weights must have a large standard 
deviation if it is to give results appreciably different from the set 1, 
1, • • *, 1; also if two sets of weights have a high intercorrelation it makes 
little or no difference which set is used. 

Wilks (1938) has dealt with the special case in which the weights are 
distributed so that V/V ^ 1 and W/W <! 1, and the weights are inde¬ 
pendent of each other and of the correlations between the variables to 
be weighted. It should be noted that the usual practice in weighting 
items in a test is to use positive weights so distributed that the standard 
deviation will not be large relative to the mean weight. Furthermore, 
in considering alternative sets of weights, the two sets considered are 
usually positively correlated so that the case dealing with weights inde¬ 
pendent of each other, which is considered by Wilks, will give composites 
that correlate less than the alternative composites usually considered in 
practice. It is also important to remember that, if we are willing to 
consider two sets of positive weights that are negatively correlated with 
each other, the correlation between the resulting composites will be 
lower than that indicated by Wilks’ formula, given as equation 47 in 
this chapter. This formula may be derived in the following manner. 

If a sample of K-weights (F t ) are drawn from an infinite population 
with mean a v and standard deviation <r v , the variable 

(V - a v )VK 

e v — 

has a zero mean and unit variance regardless of the magnitude of K 



322 The Theory of Mental Tests [Chap. 20 

(the number in the sample). (See Wilks, 1943, page 81.) The mean 
of a given sample may thus be expressed as 


V = 


evcr v 

av+ VK' 


The sample mean is thus expressed as a function of the population mean, 
and standard deviation, the number of cases in the sample, and a 
variable (e v ) with zero mean and unit standard deviation. The standard 
deviation is thus independent of the number of cases in the sample. 

Since we have limited the treatment to the case in which the weights 
V are independent of W , the mean of the distribution of products VW 
for the entire population sampled will be equal to the product of the 
means of the two populations ( a v a w ). Hence we may write 

E v g w g 

(17) — = a^w + ^, 


where e x is a random variable with zero mean and a constant standard 
deviation independent of \/ K. 

Likewise the mean of the distribution of products VWr for the entire 
population of weights and correlations will be equal to the product of 
the means {aji w r). Thus we have 

E V g W h r gh 

( ) K 2 -K + VkT^K ’ 

where e 2 is a random variable with zero mean and a standard deviation 
independent of K. 

Similarly, if we use b v to designate the second moment about the origin 
for the population of weights V, 


(19) 


E v g 


e-l 


K 


= b v + 


«3 


VK 


Correspondingly, the mean of the product V g Vh,r g h may be written 


K 


E 


V g V h r gh 


a v 2 f + 


VK 2 - K 


( 20 ) 


K 2 - K 
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Substituting W for V in the two foregoing equations, we obtain appro¬ 
priate expressions for the W-weights. 

Substituting equations 17, 18, 19, 20, and analogous equations for 
the weights W in equation 7, we have 


Equation (21) 


RxyXw 


K [“■*• + vie] + “ K) + v i *- k ] 

x { K [ t ” + ^] +(KJ -4" ,,+ vFrl]l 

In order to determine the factors influencing the composite as K becomes 
large, we shall define a new variable, 


y 


Vk 


R may then be regarded as a function of y and expanded in Taylor’s 
series about y = 0. If we set y/K = l/y in equation 21, multiply 
numerator and denominator by ?/ 4 , and define three new functions, 
G(y), H(y), and F(y), we obtain 

H(y) 

(22) R Xv x w = G(y) = 


where 


Vf7(v) 

(23) H(y) = a^a m y 2 + ejy 3 + a v a w r(l — y 2 ) + e 2 y 2 vT — J/ 2 , 

(24) F v (y) = b v y 2 + e 3 y 3 + a 2 f{ 1 - y 2 ) + e±y 2 Vl - y 2 , 
and 

(25) F w (y) = *M/ 2 + fis y 3 + - 2/ 2 ) + W 2 Vl - y 2 . 

We are now regarding Rx v Xw as a function of the variable y, and the 
parameters a, 6, and f. The problem is to evaluate this function in the 
vicinity of y = 0. Using Taylor’s theorem for small values of y, we 
may write 


(26) 


G(y) = G(0) + yQ'(0) + G"(0). 
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Setting y *= 0 in equation 22, we see that 

(27) G(0) = 1. 

The problem is now to evaluate (7'(0) and C"(0). 

To evaluate (?'(0), we differentiate equation 22, obtaining 


[Chap. 20 


(28) G'{y) 


H'(y) 


_GO/) r r'.fo) 

VFjjj) VFM) 2 L F v (y) + F w (y) - ' 


To determine the value of equation 28, when y = 0, we need the deriva¬ 
tives of equations 23, 24, and 25. These derivatives are 


(29) 


H'(y) = 2a v a w y + 3e x y 2 - 2 ajijry — e 2 


3y 3 - 2y 
y/\ - y 2 


(30) F' v (y) = 2 b v y + 3 e 3 y 2 - 2a v 2 fy — e 4 
and 

(31) F' w (y) = 2 b w y + 3 e 6 y 2 - 2a w 2 fy — e 6 


3 j/ 3 - 2y 
Vl — y 2 ’ 

3y 3 - 2y 

Vl - 2/ 2 


Setting y = 0 in equations 28 to 31 inclusive, we find that 

(32) F'„(0) = F' tt ,(0) = H'( 0) = G'(0) = 0. 

To evaluate G"(y), when y = 0, we differentiate equation 28, obtaining 


(33) G"(y) = 


H"(y ) 


tf'O/) 


~F',(y) F' w (yY 
- F v (y) F w (y) - 


VF v {y)F w {y) 2V¥^j)F w {y) L F v (y) F w (y). 

G'(y) 

i-1- 

w(y) 


F'M F' w (y) ~ 
+ F a (y). 


2 L F v {y) 


■ •GO/) r F"„Q/) _ /F'„(y )\ 2 F",(y) _ / FUs/h 2 ] 

2 L F v (y) \F v (y) ) F w (y) \F w (y)/ J 

Let us set y = 0 and substitute equations 27 and 32 in equation 33, 
obtaining 

H'\ 0) F"M F" w (0) 


(34) 


(?"(0) 


VF„(0)F„(0) 2F„(0) 2F W (0)' 


The terras in the denominator of equation 34 may be evaluated from equa¬ 
tions 24 and 25. The terms of the numerator, however, require the 
derivatives of equations 29, 30, and 31. These derivatives are 
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Gy* - Gy 2 + 2 
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(35) H"(y) = 2a v a w + 6ei y — 2a v a w r + e 2 

(36) F"„{y) = 2 b v + 6 e 3 y — 2a„ 2 f + e 4 

(37) F'\ 0 (y) = 2 b w + Ge$y — 2 a w 2 r + 


Gy* — Gy 2 + 2 
(l-^)H ’ 

6«/ 4 - % 2 + 2 
u-V)* 


Setting y = 0, we have 

(38) H"( 0) = 2a v a w — 2a v a w r + 2e 2 , 

(39) F" v (0) = 2b v - 2a v 2 f + 2e 4 , 

(40) F" w (0) = 26w — 2au, 2 f + 2e 8 . 

Substituting equations 38, 39, and 40, as well as equations 24 and 25 
in equation 34, we have 


2 2e 2 

(41) G"( 0) = - - 2 + —— 

f dpd w f 


b v + e 4 b w + e Q ^ 

« 2 
L a/r a w z r J 


If we factor out — 1/r in equation 41, and rearrange the terms, we have 


(42) (?"(0) 


-Q 


b v 


1 + 


2 e 2 

1-+ 


L(Zy djjp dpdyp 

From the gross score formula for variance, we see that 

(43) <r 2 = b v — d v 2 

and 

(44) <j ~ b w d w . 

Substituting equations 43 and 44 in equation 42, we have 


e 4 €q 

2 ' 2 
d y dip - 


(45) G"(0) 


~G) 


_ 2 2 
u (Ty, 


2c 2 

2+ — + 


a, a a 




^4 e 6 ~ 

a v 2 a w 2 . 


Substituting equations 27, 32, and 45 in equation 26 and setting 
y 2 = 1//£, we have 

1 IV 

(46) ItxvXw ~ 1 ~ ~ 

2rK 


<Ty (T w 2c 2 ^4 ^6 1 

q • o ' o • o I 

dpdf/p dy dyy J 


If we consider the expectation of li ) the variables designated by e t * 
will vanish, dince each of these variables was defined in such a way as 
to have a mean of zero and a constant standard deviation. Designating 



326 The Theory of Mental Tests [Chap. 20 

the mean value of R by 22, we have the final formula given by Wilks 
(1938), page 26: 


Equation (47) 

' 1 - m [: 


a 2 <r w 2 
a} a w 2 


1 j_ r(£Y+(*yi, 

J 2fK L\F/ \WJ J 


where 22 is an approximation to the mean value of the correlation 
between two weighted composites, 
f is the average intercorrelation between the variables being 
combined, 

K is the number of variables being combined, 
i t v and a w are the standard deviations of the two populations of 
weights being considered, and 

a v and a w are the means of the two populations of weights being 
considered. 


In the absence of information on the mean and the standard deviation 
of the population of weights being sampled, the values for the sample 
(t^, W, V , and W) may be used instead. The variance of R is given by 
terms of the order 


(R - 22) 2 = 


1 r 2e 2 
4 7k 2 


e 4 



Since each term e* has a constant variance independent of K, we see 
that the variance of R is of the order (1 /K) 2 so that the individual R 
terms will vary from It by terms of the order 1/K. 


Equation 47 may be used as an approximation of the corre¬ 
lation between two weighted composites. It should be noted 
that the equation does not apply if (a) the average intercorre¬ 
lation of the variables (f) is near zero or is negative , or (b) 
negative and.positive weights are used so that a is near zero 
and small in relation to <7, or (c) the correlation between the 
two sets of weights (r vw ) is negative. Under any one or more 
of these conditions , a more general equation such as equar 
tion 15 must be used. 


For example, equation 47 indicates that, if the quantity 2 Kf is thirty 
or larger, there is no point to bothering with different weighting systems, 
unless we are prepared to consider negatively correlated weights or 
weights some of which are positive and some negative. 

In arriving at equation 47, Wilks assumed that there was no prob¬ 
ability dependence between the V’a and W* s. There might or might 
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not be such dependence between r and V or r and W. He also assumed 
that f was greater than zero, and that the number of r g h values greater 
than zero was of the order of K 2 . The use of generally positive weights 
was also assumed, that is, V and W were assumed to be larger than V 
and W, respectively. 

2. Predicting an external criterion by multiple correlation 

If an external criterion is available, and we desire to weight the sub¬ 
scores in such a manner that the composite score will have the highest 
possible correlation with the criterion, the method of multiple correlation 
is the one to use. Again it must be remembered, as pointed out in the 
preceding section, that the precise method of weighting is not important* 
unless we are dealing with relatively few tests that are not highly corre¬ 
lated with each other. 

We will present the proof, using calculus and the solution of linear 
equations by determinants. 

In the multiple correlation problem we have one criterion or de¬ 
pendent variable, which is to be approximated as closely as possible by a 
weighted sum of the independent variables. We may write 

K 

(48) x t Q = bixn + b 2 x i2 H-h b K XiK = ]£ b g x igy 

g=i 

where Xio is the predicted criterion score of the 

ith individual, 

b gy (g = 1 • • • K) is the weight assigned to the 0 th 
test, and 

Xi g , (i = 1 • • • N; g = 1 • • • K) is the deviation score of the ith 

individual on the 0 th test. 

The multiple correlation problem is to choose the values of b g so that 
the correlation of the criterion scores (x 0 ) with the predicted criterion 
scores (£ 0 ) will be as large as possible. This is the same as making the 
sum of the squares of the differences between xo and x 0 as small as 
possible. We may write 

N 

(49) E = 2 (** ~ **o) 2 > 

1=1 

where Xio is the criterion score of the ith individual, and E is the error 
of prediction. The multiple correlation problem is to select the b’a 
that will minimize the value of E. 

If we substitute equation 48 in equation 49, and set each of the deriva¬ 
tives (dE/db g ) equal to zero, we have 
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'SxqXi — biSxx 2 — b^Sx 2 Xi - b&XKXi =*= 0 

Sx 0 x 2 — b{Zx\X2 — 6 2 2x 2 2 - bjc2xKX2 — 0 


Srr 0 rcic — 6iZ:r 1 a:jjr — 6 2 S^ 2 ^- b^XK 2 = 0. 

The equations 50 give the fc’s in terms of the variances and covariances 
of the independent variables, and the covariances of the dependent with 
each of the independent variables. 

The solution of equations 50 can be expressed in determinantal form. 
Let the determinant 




1 

7*10 

7*20 

7*30 * 

• * r K o 



7*01 

1 

7*21 

7*31 • 

• • Tki 



7*02 

7" 12 

1 

7*32 * 

• • TK2 

(51) 

A = 

7*03 

7*13 

7*23 

1 

• • r K 3 



roK 

7*1 K 

7*2 K 

7*3 K * 

• • 1 


Let A 0 o be the determinant formed by deleting the first row and the 
first column of A, A 0 i be the determinant formed by deleting the first 
row and second column of A, A 02 be the determinant formed by deleting 
the first row and third column of A, and in general let A 0g be the deter¬ 
minant formed by deleting the first row and the (g + l)-th column of A. 
Then the solution of equations 50 is given by 


601.234 ••• k = (—1)° 

602.134 • k = (““l) 1 


AqiSq 

Aoo^i 

Ao2$0 

Aqo$2 


(62) 


3.124 


(-1) 2 


Aoo$3 


60 K. 


1234 


(K- 1) 


AooSjt 
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In general, we may write 

(53) b 0g 'i2 ... (g-i)(*+n -- k = (~1) (8 "’ 1) -~ g ~• 

When the multiple regression weights have been determined accu¬ 
rately for a set of variables, it is well to remember the proof given in 
the preceding section that two sets of highly correlated weights will give 
highly correlated composites. This means that, instead of using the 
awkward fractional weights indicated by the multiple correlation solu¬ 
tion, we approximate them by a set of simple integral weights. 

The weights indicated in equations 52 when used in equation 48 will 
give the best estimate of x 0 in the sense that the error (E) indicated in 
equation 49 will be a minimum. Dividing E by N, the number of cases, 
and taking the square root gives the error of estimate, which may be 
written 



The weighted sum Xq given by equation 48, using the weights of equations 
52, correlates higher with x 0 than any other possible weighted sum of 
the independent variables • • • x K . This correlation is the multiple 
correlation, and its value is given by 


(55) 


^ 0.123 K ~ 



A 

Aoo 


For the simplest possible case of multiple correlation, that of predicting 
one criterion (x 0 ) from two independent variables and x 2 ) } these 
equations may be written very much more simply. Equations 52 for 
the weights of X\ and x 2 become 


(56) 


& 01.2 = 


‘ s o( r ()l ~ r 02 r 12 ) 




S()( r 02 ~ 7 * 01 *’ l2 ) 

s 2 ( i ~ r 12 2 ) 


The error of estimate of equation 54 becomes 


p 

0.12 — 5 0 


+ 2r 0 iro2D2 r oi 


r 02 2 - n 2 2 


1 - r 12 2 


(57) 
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The multiple correlation given in equation 55 becomes 


The equations 56 to 58 give the weights, error of estimate, and multiple 
correlation for the three-variable case in terms of the three intercorrela¬ 
tions and three standard deviations of the original variables. 

It should be noted that it may readily happen in multiple correlation 
methods that a given test is assigned a negative weight. This means 
that the better a person does on that test, the poorer will be his composite 
score, and vice versa. Such weights should lead to a careful scrutiny 
of the test and a consideration of the reasonableness of such a finding. 
Many situations arise where a negative weight is plausible. However, 
it should be noted that in a test which is to be given repeatedly, and for 
which the scoring method may become known to the candidates, it 
would be very unwise to retain a test with negative weight, since it is 
very easy for the subjects who know the scoring method to attempt to 
obtain, and succeed in getting, a low score. Such a change in motivating 
conditions would destroy any predictive value that the test might have 
had previously when the subjects were all attempting to obtain a high 
score. Adkins et al. (1947), page 170, indicates that negative weights 
are not used in civil service tests. 

In dealing with more than three variables, it is necessary to use special 
computational methods, such as those described in Guilford (19365), 
pages 390-404. 

If a criterion is available , multiple correlation methods give 
the best weights for predicting that criterion. Simple inte¬ 
gral approximations to these weights will usually give a com¬ 
posite score that correlates almost as well with the criterion . 

3. Selecting tests for a battery by approximations to 

multiple correlation 

In addition to specifying the best set of weights to use for each of 
the tests in a battery, it is frequently desirable to eliminate some of 
the tests as well. For example, we might use six subtests in an experi¬ 
mental battery, and wish to know which three would give the highest 
multiple correlation. It would also be desirable to know what the mul¬ 
tiple correlation was for all six tests, as well as for the best set of three, 
in order to see how much was lost in predictive accuracy by eliminating 
the poorest half of the tests. 
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The only certain method of obtaining an exact answer to such ques¬ 
tions is to work the zero-order correlations, the multiple correlations for 
all possible combinations of two tests, for all possible combinations of 
three, or four, or five, and the multiple correlation using all six tests. 
This would mean, for six predictor tests and one criterion, computing 
15 + 20+15 + 6+1, or 57 multiple correlations. We could then 
easily pick the best combination of tests at each stage and decide 
whether the additional testing time was adequately repaid in terms of 
higher validity coefficients. With the ordinary computing methods in 
use at present, the labor of such computations makes such analyses 
prohibitive. It may well be that, with the development of high-speed 
electronic computing machines, the exact solution of such a problenl 
would be more economical than many of the approximation methods now 
in use. 

Frisch (1934) described a method of dealing with what he termed 
“complete regression systems” by “confluence analysis.” This method 
essentially involved computing multiple correlations and multiple 
regression weights for all combinations of the variables involved in 
order to understand thoroughly the relationships among these variables. 

One very good approximation method is to look first at the zero- 
order correlation coefficients, and select the one best test. This test is 
then tried out with each of the K — 1 remaining tests to see which two 
(including the one best) will give the highest multiple. These two best 
are then combined in turn with each of the K — 2 remaining tests to 
pick the “best” combination of three. With this method we should 
select three tests out of a set of six by working multiples for only 5 
two-test composites and 4 three-test composites, that is, 9 instead of 57 
multiple correlations. Such a method has been described by Toops 
(1923). Other closely similar procedures have been described by Wherry 
(see Stead, Shartle, and associates, 1940, Appendix V), by Toops (1941), 
and by Wherry and Gaylord (1946), and Horst (19345). 

If we are willing to assume that the best set of two tests includes the 
best one, the best set of three includes the two previously indicated, 
and so on, and in addition to assume that the relative weights determined 
for the best two also hold when these two are combined with a third , and 
so on up, a very quick and easy graphic approximation method has been 
provided by Jenkins (1946). 

4. Weighting according to test reliability or inversely as the 
error variance 

Giving the more reliable tests greater weight in a composite has been 
suggested by Kelley (1927), pages 211-213, Thurstone (1931a), pages 
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88-90, and Richardson [see Appendix D, pages 392-396, in Horst 
(1941)]. Kelley and Richardson give the gross score weight of any test 
(g ) as r gg /( 1 — r gg ), that is, the weight is the ratio of the reliability 
coefficient to the error variance of the standard scores. Thurstone 
follows a slightly different procedure and finds the gross score weights to 
be Vr^/( 1 — r gg ). The former formula can readily be derived from 
multiple correlation theory with two assumptions. The first is that all 
the tests are measures of the same true score, except for the fact that 
they contain different proportions of random error. This assumption 
means that the intercorrelations will be unity, when corrected for 
attenuation. The second assumption is that we wish to maximize the 
correlation between this common true score and the weighted composite. 
Stated in mathematical terms, these assumptions mean that 

(59) r gh = (g ^ h = 1 • • • K), 

the intercorrelation between any two tests is equal to the geometric 
mean of the reliability coefficients. The criterion may be assumed to 
be the true score, in which case the validity coefficient of each test is 
given by 

(60) T tg = 

or it may be assumed that the criterion is another test x 0 , which also 
has the same true score, in which case the validity coefficient of each 
test is given by 

(61) r 0g = Vroorgg. 

Identical relative weights are given by either assumption. In the 
following derivation we shall use equations 59 and 61. We see that the 
criterion reliability (r 0 o) is a factor common to all the weights, hence 
may be ignored if we are interested only in relative weights. Substi¬ 
tuting equations 59 and 61 in equation 51, we have 



l 

V Wit 

^ r Q0 r 22 

' • ^ 1'00 r KK 

(62) A - 

^ %) r ll 

1 

Vr u r 2 2 

Vr„r 2 2 

1 

'• ^ I'n^Kfc 

• • V^i T22<'KK 


v'r 00 r KK 

VrufjcK 

V TzzTkk • • 

• 1 
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Since the factor s 0 /A 0 o is common to all the weights in equations 52, 
we may ignore it and evaluate terms of the form A 0g /$ g to determine 
the relative weights for the variables. We may form A 0 i by deleting 
the first row and second column of equation 62 and then transform the 
determinant Aoi by multiplying the first column by VrWVroo and 
deducting the product from the second column. In general multiply 
the first column by y/ r gg /Vroo and deduct the product from column g. 
These transformations do not alter the value of the determinant, so 
we have 



vr 00 rii 

0 

0 

• 0 


v r 00 r22 

1 

to 

0 

•• 0 

(63) Aoi - 

Vr 00 r33 

0 

1-^33 • 

•• 0 


VVqo trk 

0 

0 

• • 1 - r KK 


Expanding equation 63 in terms of minors of the first row, we have 

(64) A 0 i = VV 0 (Ai (1 — 7 * 22 )(1 “ ^ 33 ) * * * (1 — r K ic)- 

If we multiply and divide the right side of equation 64 by (1 — rn) 
and let 

P = (1 - r n )(l - r 22 )(l - r Z3 ) •••(!- t K k), 


we have 


Aoi — 




1 - r n 


We may write all the other terms of the form Ao g similarly, omitting 
the common factor P in order to deal only with relative weights. If 
the factors common to all the weights are designated by C, and the 
standard score weight for test g is indicated by fi 0 §. 12 • • k > we have equa¬ 
tions of the form 


(65) 


Cfio g .i 2 ••• K 



From equations 53 of this chapter and 20 of Chapter 3, we obtain 
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the weights appropriate for use with gross scores. These weights, desig¬ 
nated by 6 0 *.i 2 are 

( 66 ) C'bo t .i 2 ... jc = ——, (g = l •■■K) 

1 ~ r gt 

where C f designates the factors common to all the weights. Weighting 
formulas 65 and 66 were presented by Kelley (1927), pages 211-213. 
The detailed derivation has been given by Richardson in Appendix D, 
see Horst (1941, pages 392-396). Thurstone (1931a) has suggested the 
use of weights dependent on reliability, which differ from these by a 
factor of /r^g. 

The use of the weights of equation 66 depends upon the assumptions 
indicated in equations 59 and 61. Whenever several tests and their 
reliabilities are available so that the weights of equation 66 may be 
used, it is also always possible to calculate the intercorrelations among 
these tests in order to verify the assumption in equation 59. It is 
probably only in exceptional cases that this assumption would be 
verified. Constructing a set of tests that satisfy a single factor solution 
is a fairly difficult job. Equation 59 is a far more stringent requirement 
than a single factor battery. Tests may have one common factor, and 
differ both in error and in factors specific to each test. Equation 59 
does not allow for any possibility of a factor specific to each test. 

In summary , we may say that , while it is usually desirable 
to give greater weight to the more reliable test, there is usually 
no special justification for the particular weights indicated by 
equations 66. 

5. Weighting inversely as the standard deviation 

Weighting gross scores in tests by the reciprocal of the standard 
deviation has also been frequently given as one method of combining 
tests. See Kelley (1927), page 66 , Thurstone (1931a), pages 83-87, and 
others. It should be noted that such a weighting principle is justified 
only in highly specialized and unusual cases. For example, if the true 
variance of the group tested is large, this will contribute to making the 
standard deviation of the test large, and would seem to be no valid 
reason for decreasing the weight of the test. On the other hand, test 
score variance may be increased by increasing the error variance of a 
test. If such is the case, a decreased weight for the test is plausible; 
however, there is no reason for making this weight proportional to the 
total standard deviation, which is the square root of the sum of true and 
error variance. The third factor that may influence the standard 
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deviation of a test is its length. A 100-item test will have a much larger 
standard deviation than a 10-item test. Clearly on a common-sense 
basis, we should not increase the accuracy of a test by lengthening it 
from 10 to 100 items, and then reduce the weight of the better test by 
weighting it inversely as its standard deviation. A detailed criticism 
of the method of weighting inversely as the standard deviations is given 
by Richardson. See Horst (1941), Appendix D, pages 385-388. 

It is interesting to note that under certain highly specialized conditions 
the multiple correlation weights of equation 53 become equal to the 
reciprocal of the standard deviation of the test. If it is assumed that 
all the test intercorrelations are equal to some value, r, for instance, 
and that all the validity coefficients ( r 0g ) are equal to some value, fdt 
example, v f then A 0 i = A 02 = • • • = Ao K, so that all the weights indi¬ 
cated in equations 52 are identical except for the standard deviation 
appearing in the denominator. In other words, if all the independent 
variables in a set are identical with respect to validity, and have identical 
intercorrelations, so that no special test clusters are formed, and if in 
spite of such remarkable similarity the tests still differ in variability, 
the multiple correlation weights are inversely proportional to the standard 
deviations of the tests. 

It should also be noted that various methods of scoring a test may 
also have an effect on its standard deviation. For example, if two 100- 
item tests are scored differently, test a receiving one point per item and 
test b ten points per item, then if the tests are reasonably similar the 
standard deviation of b and its influence in any composite would be 
very much greater than that of a. Similarly, tests scored number right 
will have a different standard deviation if the scoring system is changed 
to R — cW. 

The standard deviation of a test is an important factor in determining 
the influence of that test on any composite. However, it is not possible 
to set up any sensible routine method for using the standard deviation 
in determining the weight of a test. If one test has a larger standard 
deviation than another test, and this difference seems to be due to factors 
that are largely irrelevant to the reliability and validity of the test, 
weighting inversely as the standard deviation is probably reasonable. 
If the test with the larger standard deviation is more valid or reliable, 
or if it seems to be reasonable to assume that it would be more valid 
and reliable because it is a longer or a better test, then simply adding 
in the gross scores of the two tests would be a reasonable procedure, 
and weighting inversely as the standard deviations would only help 
to decrease the influence of the best test. On the other hand, if it seems 
that the test with the larger standard deviation owes this extra varia- 
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bility to error variance, some still different weighting scheme that would 
decrease the weight of or eliminate the poorer test would seem reasonable. 

Weighting inversely as the standard deviation is to be avoided 
as a routine procedure. Other factors being approximately 
equal, a composite mil be influenced more by a test with a 
large standard deviation than by one with a small standard 
deviation. This higher weight is probably desirable if the 
greater standard deviation is due to such factors as greater 
test length or reliability that contribute to true variance. The 
higher weight is probably undesirable if it seems that the 
greater standard deviation is due to irrelevant multiplying 
factors in (he scoring key or to any factors that would increase 
the error variance. 

6. Weighting inversely as the error of measurement 

Weighting each test by the factor l/(s g y/l — r gg ) is a method that 
would be free of the obvious objections that apply to weighting either 
inversely as the standard deviations or inversely as the square of the 
error of measurement. For example, such a weighting would auto¬ 
matically correct for any arbitrary change in scoring that affected the 
standard deviation of the test without altering its reliability. As the 
true variance increased, or the error variance decreased, there would be 
an appropriate direction of adjustment of the weights. If the test length 
were altered so as to raise the reliability, the weight would be increased. 
As an arbitrary rule of thumb method for use when no criterion is 
available and the tests seem indifferent as far as judgment of content is 
concerned, it would seem that such a system would be appropriate. 

Weighting inversely , as the error of measurement automati¬ 
cally corrects for any arbitrary multiplying factors introduced 
in the scoring system, increases the weight of a test as true 
variance or reliability is increased, and decreases the weight 
of a test as the error variance is increased. Although no 
rationale has been suggested for this method, it has excellent 
properties from a common-sense point of view, and is prob¬ 
ably the safest arbitrary rule of thumb method to recommend 
for general use, when no criterion is available and when test 
intercorrelations are not computed. 

7. Irrelevance of test mean, number of items, or perfect score 
In most amateur discussions of weighting of tests the first factors 

considered are the number of items in the test and the average magnitude 
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of the score. It is believed, for example, that if gross scores are added, 
the effect will be to give a 100-item test twice the weight of a 50-item 
test. That such is not the case can be seen for example by assuming that 
the 100-item test was a very easy one on which everyone obtained scores 
ranging from 95 to 100. Adding scores on this test to a student’s record 
would then, at the most, make a 5-point difference in the total score. 
If, on the other hand, the 50-item test were composed of fairly difficult 
items and were fairly reliable, it could easily be that scores on it would 
range from 20 to 50. In other words, adding this test would make a 
30-point, difference in extreme cases, and a 10- or 20-point difference in 
the majority of cases, so that the total score would agree rather closely 
with the score on the 50-item test and not correlate with the score dh 
the 100-item test. 

From the illustration just given, we also see that the weight a test 
exerts is not related to the magnitude of the average score either. The 
100-item test in the illustration would have a mean of 97 or 98 correct 
answers, and the 50-item test would have a mean in the 30’s. The initial 
amateur reaction on seeing two sets of test scores, one set mostly in the 
30’s and the other all in the high 90’s, would be to feel that, if the two 
sets were added, the first would have approximately one-third the weight 
of the second. As we have just seen, the total range of scores for a test 
is- the important factor in determining its effect on a composite. 

It can be seen that, of themselves, the test mean, and number of items 
have no effect whatever on the relationship between a test and the 
composite of which it is a part. Both factors should he completely ignored 
in considering weighting problems. 

It might be noted that, if all students do not have the same series of 
tests, the mean score is an important factor. Suppose, for example, 
that the students have the choice of answering questions X or F, or of 
submitting answers on the X-test or, alternatively, on the F-test. If 
the X-scores range from 30 to 50 with an average around 40, and the 
F-scores range from 70 to 90 with an average around 80, clearly the 
students who have chosen F and not X will get in general 40 more points 
in their total than those who have chosen X and not F. In such a case 
it is possible to “adjust” by adding 40 points to each person’s X-score 
or subtracting 40 points from each person’s F-score. However, another 
complication arises here. Can it correctly be assumed that the students 
who submitted X are on the average identical with those who submitted 
F? Frequently when alternative choices are given it happens that 
better students tend to pick one and poorer students the other, so that 
equating the average scores is not an appropriate procedure. Neither 
is it correct simply to add the gross scores, which means assuming that 
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these correctly represent the difference between the two groups. In 
general, it is impossible to determine the appropriate adjustment 
without an inordinate amount of effort. Alternative questions should 
always be avoided. The only possible rational solution is in the type 
of methods suggested in the chapter on standardizing tests. A common 
section must be used as the basis for equating the alternate parts. In 
order to use these equating procedures, both the common parts and the 
alternate parts need to be of a reasonable length to secure reliability, 
so that the equating will be reasonably stable for similar groups. In 
the conventional examination, where the student is asked to answer 
any six of nine questions, we are really setting nine different examinations. 
If the examination were given to 150 students, there would be 
only 100 per examination on the average, which would mean that many 
of the combinations would be taken by very few students. The data 
would be inadequate for equating, and the labor would be great. The 
alternative questions should in general not be used. If some choice 
seems unavoidable, the choice should be set up systematically by requir¬ 
ing a given set of items to be answered by everyone, and then reducing 
the number of possible combinations that can be submitted by requiring 
that the student answer one question from this set, or that he answer one 
of the following three sets of questions. 

The number of items in a test {the perfect score ) and the test 
mean have no effect in determining the tesVs influence in a 
composite and should be ignored when considering the appro¬ 
priate weight for a test . The only exception arises when 
alternative questions {or tests) are used , in which case we 
must allow not only for the test mean but also for ability dif¬ 
ferences in the groups making the different choices . 

8. Effect of a subtest on a composite score 

Having considered the effects of alternate sets of weights used on the 
same set of subtests, we may turn to the problem of the effect of a subtest 
on a composite score. 

Most of the discussions of this topic attempt to regard the composite 
as broken up into parts, and then assess the percentage contribution of 
each of the parts to the total. In general, such a solution is impossible, 
and will not be given here. 

A simple, direct, and meaningful way to think of the contribution of a 
part to the total is to use the correlation between the part and the total 
as an index of the contribution of the part. Wilks (1938) has suggested 
this method, pointing out that, if each part has the same correlation 
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with the total, in one sense each part has the same weight in determining 
total score. It should be noted that using the correlation coefficient as 
an index enables us to define the “same” weight of two tests and to 
define “greater” weight and “less” weight. However, it is not possible 
to say that one test has two or three times the weight of another test. 

Using the part-total correlation as an index of relative weights, we 
are able to speak in terms of equal, greater, or less weight. It does not 
enable us to divide the total into a given number of parts, one for each 
subtest, totaling to 100 per cent; nor does it enable us to speak of 
double or triple weights. However of the various methods that have 
been proposed of assessing the relationship or the “contribution” of the 
part to the total, it is the most generally useful and intelligible. * 
The correlation between any part X g and the total Xc, which is a 
weighted sum of the parts, may be expressed as follows. Let 

K 

(67) X iC - WiXn + W 2 X i2 + • • • + W K X iK = £ W g X ig . 

g--~l 

The composite score ( X&) for individual i is the weighted sum of his 
scores on the individual tests. If the X's are regarded as deviation 
scores, we may write the correlation of part X g with Xc as follows: 

£ x ig x iC 

(68) r gC = -- 

Ns g sc 

Expanding and summing the terms in X*c, we have 

N K 

E E w h x ig x ih 

, v t=l h=l 

(69) r gC =-rr- 

Ns g s c 

N 

Reversing the order of summation and writing E X^X^ as a covariance 

<—i 

or a variance term gives K 

E W k r gh s g s h 

(70) r gC = —- (r ge = 1). 


If we separate the one variance term from the K — 1 covariance terms, 
and divide numerator and denominator by s g , we have 

K 

W g s g + E W h r gh s h 


(71) 


TgC - 


(A 5* g), 
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where r g c is the correlation between test g and the weighted com¬ 
posite, 

W g (or Wh) is the weight assigned to any test, 

8 g (or 8h) is the standard deviation of the test, 

r g h is the intercorrelation between two tests, and 
sc is the standard deviation of the composite. 

Since sc is identical for each of the tests, it may be ignored. We 
see then that the correlation between the composite and any test is 
determined by the weight assigned that test, the standard deviation of 
that test, and the weighted sum of the correlations between that test 
and each of the other tests. 

In particular it should be noted that the test mean Mx and the 
number of items K (that is, the perfect score) have nothing whatever 
to do with determining the correlation of any test with the total com¬ 
posite. This correlation is determined by the test standard deviations, 
the intercorrelations, and the weights assigned. If the weighting factor 
for a test is increased, the correlation of that test with the composite 
will be increased. If the average correlation of a test with the other 
tests is increased, its correlation with the composite will be increased. 
If the standard deviation of a test is increased, its correlation with the 
composite will be increased. 

If all the tests in a set have zero intercorrelations, the correlation of 
any test with the composite will be proportional to the product of its 
assigned weight and its standard deviation. However, in the usual case 
this term will be small in proportion to the weighted sum of the corre¬ 
lations of that test with all the other tests. 

To show that the factors of test variance and intercorrelations are 
crucial and must be considered, we may cite an illustration reported 
in Stuit (1947), pages 305-306. In one basic engineering school the 
students' time was divided as follows: about four-sevenths in shop work, 
one-seventh classroom work in mathematics, and two-sevenths class¬ 
room work in mechanical drawing and shop theory. The weights ( W) 
were assigned to the part grades in proportion to the time spent on the 
subject, and the weighted sum taken as the total grade. For one class 
of 350 the correlation of total score with mathematics was .86, with 
mechanical drawing was .74, and shop work, .48. The final grades were 
determined primarily by the one-seventh classroom work in mathematics, 
and only to a slight extent by the four-sevenths spent in shop work. 
The explanation is found in the standard deviations of the three sets 
of part grades. The standard deviation of the mathematics grades was 
7.7, of mechanical drawing, 4.1, and of shop work 2.5. It is also interest- 
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ing to note the means of the three sets of part grades. These were, for 
mathematics, 83.6, mechanical drawing 89.1, and shop 84.0. The mathe¬ 
matics part, with the lowest assigned weight and lowest mean score, 
had the highest actual weight in determining total score because of its 
very high standard deviation. In shop, as the standard deviation shows, 
the vast majority of men received grades in the 80 to 88 range so that 
these grades had little influence on the total, even though they were 
weighted nominally four times as much as the mathematics score. Here 
the problem was to secure a greater spread of grades in shop work. 
Since the grades of different instructors for the students' shop work 
correlated from —.11 to .55, it was clearly desirable to secure more 
uniform grading methods rather than simply to multiply such apparently 
inaccurate ratings by a factor such as ten or twenty in order to have 
them exert a predominant influence on final grades. Various gages were 
devised to measure the products of shop work quickly and accurately. 
Such increased accuracy in grading the shop work increased the varia¬ 
bility of shop grades, and thus these grades legitimately contributed 
more to the total score of the students than the classroom work in 
mathematics. 

It is hoped that this illustration will demonstrate that both the corre¬ 
lation and the standard deviation, as well as the nominal weight, must 
be taken into consideration when making combinations of scores. It 
also shows that it may not be possible to reach an adequate solution of 
the problem simply by altering the weights of the different part scores. 
It may be necessary to devise new and better tests for certain aspects 
of the work before it is possible to give these aspects their desired weight 
in the total score. 

Equation 71 shows that the correlation between a composite 
(C) and any one of its parts (g) is completely determined by 
the weights , standard deviations , and intercorrelations of the 
subtests. The test mean and the number of items (or perfect 
score ) have no effect on the correlation between part and total , 
unless they influence the standard deviations and inter corre¬ 
lations. 

9. Use of judgment in weighting tests if no criterion is available 

If several part scores are to be combined to determine a total score 
and no criterion is available, so that multiple correlation methods cannot 
be used, one method is to use judgment regarding the relative magnitude 
of the correlation desired between the total and the different part scores. 
Before such judgments can be meaningfully made, it is necessary to 
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have considerable information on the interrelationships of the part 
scores. First we note that the total number of items in each test and 
the mean of each test are irrelevant, provided all the students have taken 
each test . The necessary information includes the standard deviation, 
reliability, and error of measurement for each test and the intercorrela¬ 
tions between the tests so that we may see the kinds of relationships 
already existing in the part scores. With this information at hand, we 
decide on a judgmental basis which of the parts should be weighted 
equal, which higher, and which lower. Wilks (1938) has proposed that 
one definition of-“equal” weights be equal correlations with the com¬ 
posite score. Higher weight means a higher correlation between 
composite and the part; lower weight, a lower correlation. In terms of 
such a definition, we should then decide the relative magnitude of the 
correlations on a judgmental basis. In general, if a test is long and 
reliable, it probably should have a higher correlation with the final 
composite than a test that is short and unreliable. Furthermore, it 
must be noted that, if certain subtests intercorrelate highly with each 
other, these tests will necessarily have similar correlations with the com¬ 
posite. It is possible to decide that they will all correlate high with the 
composite or that they will all correlate low with the composite; but it 
is not possible to decide that one member of the set will correlate high 
and the others low with the composite. Roughly speaking, it is necessary 
that a set of highly intercorrelated subtests all “weight” high, or all 
“weight” low in the sense of correlating high or low with the composite 
score. When these decisions are made, the weights are found by solving 
a set of equations of the form 

K 

(72) TgCSC = hTgh^h ( r gg ^ 1)> 

h=l 

in which r g c are the desired correlations determined by judgment, and 
the W’s are the unknown weights. 

If in a problem of weighting tests in a battery no criterion is 
available , but the test intercorrelaiions and standard devia¬ 
tions are available , it is possible to define relative weights in 
terms of relative magnitude of the correlation of the part with 
the composite . Then we judge what the relative magnitudes 
of the various part-total correlations should be, enter these 
values in equation 72 and solve for the relative weights . It is 
also desirable to check the composite to see that the different 
subtests have approximately the desired correlation with the 
composite. 
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10. Use of factor analysis methods in determining weights of 
subtests 

As a general method for combining subtests we can see that, if the 
battery of tests is factored and a score set up for each factor, we have 
represented a maximum of material in the tests with the minimum 
number of scores. Such methods are dealt with in textbooks on factor 
analysis, and they are beyond the scope of this book. Usually such 
methods are too laborious and time consuming to be adopted for ordinary 
weighting problems. 

It should also be noted that this recommendation results in several 
different scores for each person. If a single grade is to be given or a 
single pass-fail line is to be drawn, the problem of how to combine these 
several factor scores into such a single total score still remains. It is of 
course possible to use judgment in assigning relative weights to the 
different factor scores. However, if judgment is to be used at this 
stage in determining the nature of the composite, it may be almost as 
good to use judgment directly on the subtest scores and avoid the work 
of the factor analysis. In a complex set of tests, however, it is likely 
that the groupings revealed by the factor analysis will make our judg¬ 
ments simpler and more meaningful. 

In determining a single score for a set of tests, it has been suggested 
that a particular factor score be used as the single score to represent the 
battery. There are several possibilities for selecting this factor score. 

The first principal axis has been suggested by Wilks (1938), Horst 
(1936a), and by Edgerton and Kolbe (1936). Computationally this 
method is quite laborious. It requires a successive approximations pro¬ 
cedure with even as few as four or five variables. It must also be noted 
that this method is directly sensitive to arbitrary changes in score 
variance. For example, if scores on one test are multiplied by ten, the 
principal axis will swing in the direction of that test. Furthermore, as 
previously indicated, it is of no help to adopt a device such as standard 
scores. Such a procedure would give great weight to short unreliable 
tests in the composite and relatively little weight to a test that was very 
long and accurate. However, provided we are able to fix arbitrarily 
on the appropriate units for each of the subtests, this method has some 
interesting properties. 

Horst derived this method by determining the set of weights that 
would maximize the variance of the composite scores (given a fixed 
value for the sum of squares of the weights). Edgerton and Kolbe 
derived the same method by determining the set of weights that would 
minimize the variance of the set of scores assigned to a given individual. 
Wilks derived the same method in seeking to minimize the “generalized 
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variance” for all individuals receiving the same score. The fact that 
these three different approaches resulted in the same method is interest¬ 
ing, It shows that, provided we fix the units for each test, the largest 
principal axis score maximizes the variance of the composite score, 
minimizes the variance of scores for a given individual, and minimizes 
the generalized variance for all persons receiving the same score. 

Use of the first centroid axis was suggested by Horst (1936a) as an 
approximation procedure. He derived the principal axis solution as 
indicated above but, recognizing its laboriousness, suggested that the 
first centroid axis be used instead as an approximation to the principal 

axis. Using the first centroid axis means making the weights for test g 
k 

proportional to 2 r &h> This term is easily obtained. There is, however, 
a-i 

the usual problem in factor analysis regarding the selection of an appro¬ 
priate value for the term r gg . In some solutions this term is given the 
value unity; in other solutions it is set equal to zero, equal to the reli¬ 
ability coefficient, or equal to some estimate of the communality of the 
test. From the initial discussion of the effects of weighting, we see that, 
if there are few small correlations, the decision on the appropriate value 
for r gg will make a great difference in the composite score. If the corre¬ 
lations are many and large, the resulting composite score will be only 
negligibly affected by this decision. 

Horst, in recommending the use of the first centroid axis, felt that it 
was an approximation to the longest principal axis. Edgerton and Kolbe, 
however, gave an illustration in which the two were quite different. 
The amount of difference between these two solutions will depend on 
the nature of interrelationships in the battery. 

Using a single common factor as a guide to the weighting of tests in a 
battery was suggested by Spearman (1927, Appendix, page xix). This 
method would apply only to a battery of tests that satisfied Spearman’s 
original two-factor theory, which means that one factor is common to 
all the tests in the battery, and in addition each test has its own specific 
factor which is uncorrelated with the general factor and with the specific 
factor in each of the other tests. The correlation between the factor 
specific to x and that specific to y may be regarded as the correlation 
between x and y with the general factor partialled out. Since the 
specifics correlate zero, we have from the formula for partial correlation 
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correlation between any two tests is the product of the correlation of 
each of those tests with the general factor. By methods formally 
identical with those of section 4 on weighting by reliabilities, we can 
show that the multiple correlation weights to use in predicting the 
general factor from the tests in a one-factor battery are proportional to 


for any test (x), where r gx is the factor loading of test x or its correlation 
with the one common factor. The methods of determining whether or 
not a given set of tests is adequately represented by one common factor 
and the methods of determining the correlation with this factor are 
discussed in the various textbooks on factor analysis; see Spearman 
(1927), Thurstone (1935a), (19475), and others. 

These various solutions based essentially on factor theory are interest¬ 
ing. The principal axis solution is especially interesting since it both 
maximizes interindividual differences and minimizes intraindividual 
differences. It is, however, markedly influenced by arbitrary decisions 
made on test scoring. Similar remarks apply to the centroid solution. 
Where only one factor is necessary to account for the correlations, this 
system will give a unique solution. It should be noted that, from one 
point of view, only tests that form a one-factor system should be com¬ 
bined into a single score. Where several factors are present, several 
scores should result. 

The results of any one-factor solution to the weighting problem should 
still be inspected to determine the correlation between the composite 
and the individual tests to be certain that, from a judgmental point of 
view, there is nothing obviously peculiar or undesirable about the solu¬ 
tion. 

If no criterion is available and the battery of tests turns out 
to have only one common factor, the tests may be weighted to 
give the best prediction of this factor. If the battery is a muUi- 
factor one , it is possible arbitrarily to select the longest prin¬ 
cipal axis or the first centroid axis as the best one to represent 
the entire battery. 

11. Weighting to equalize marginal contribution to total 
variance 

Wilks (1938) has suggested another method of defining and deter¬ 
mining “equal” weights. He points out that the variance of the weighted 
sum of 1C — 1 tests will be less than that of the K tests, and suggests 
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that in one sense all the tests are equally weighted if the variance of 
any combination of K — 1 tests is equal. That is, the variance of the 
total test would be equally affected no matter which one of the K 
constituent tests was removed from the composite. This method again 
is computationally complex and seems also to have little in its favor. 

12. Weighting to maximize the reliability of the composite 

If no external criterion is available, we may wish to assign weights 
so that the reliability of the weighted composite will be a maximum. 
The solution to this problem has been given for a special case by Mosier 
(1943), and for the general case by Thomson (1940) and Peel (1948). 
The solution given by Thomson can be shown to be equivalent to that 
given by Peel, and since that given by Peel is much simpler we shall 
use it. 

Let the matrix of intercorrelations for the test battery be designated 



1 

ru 

n 3 • 

• r\K 


**12 

1 

^23 * 

• T 2 K 

R = 

r 13 

7*23 

1 

• r 3 K 


r\K 

?*2 K 

TZK * 

•• 1 


Let the matrix of intercorrelations between the tests of the two parallel 
batteries be designated 



7*11 

r 12 

7*13 • 

• TiK 


rn 

r 2 2 

7*23 * 

• r 2K 

c = 

r 13 

7*23 

7*33 ’ 

• r ZK 


riK 

T2K 

r 3 K • 

* • r K K 


The off-diagonal entries of C are identical with the corresponding 
entries of R. The two matrices differ only in that one has reliability 
coefficients, and the other has unity in the diagonals. 

Let us also define the row vector of weights, 

W - Wi W 2 W 3 • • • W K - 

Since the second battery is assumed to be parallel to the first, the 
matrix of intercorrelations (R) for this battery is assumed to be identical 
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with that for the first battery. Equation 7, the general expression for 
the correlation of two weighted sums, is given in matrix notation by 

_ WCW' 

XrXw “ Vwrw' Vwsw' 

Since we are dealing with a reliability coefficient, the two batteries of 
tests and two sets of weights are identical. Thus the two factors in the 
denominator are identical, giving 


(74) 


_ WCW' 

R x k'.y ty ~ * 

WRW 


4 


Since the reliability of the composite will remain the same if all the 
weights are multiplied or divided by any arbitrary factor, another con¬ 
dition is needed to determine the weights. We may say that the weights 
shall be chosen to make the variance of the composite unity, that is, 


(75) WRW' = 1. 

In order to select weights that maximize equation 74 subject to the 
condition given in equation 75, we define a new function, using the 
Lagrange multiplier (X) as follows: 

(76) RxwXw = WCW' - X(WRW' - 1). 

Differentiating equation 76 with respect to each of the TF’s in turn, 
setting each derivative equal to zero, and dividing by 2 gives the set of 
equations 

(77) WC - XWR = 0. 

Postmultiplying both matrices by W' and solving for X gives 

WCW' 

(78) X = WRW , - Rx w xw 


Since X is a scalar, it may occupy any position in a product. Thus the 
solution of equation 77 for W gives 

(79) W(C - XR) = 0. 

Equations 79 have a solution other than W = 0 only if the determinant 
of the coefficients of W equals zero; thus 

(80) | c - XR | - 0. 

Equation 80 is a ifth degree equation in X. Since, from equation 78, 
X sbb Rx w xw> we are seeking the maximum reliability, we choose X as 
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the largest root of equation 80, substitute this value in equation 79, 
and solve for the relative weights. Using the condition given by equa¬ 
tion 75 completes the solution for the weights that will maximize the 
reliability of the weighted composite. 

A simplified formula that gives a principal axis solution has been given 
by Green (1950a). 


13. The most predictable criterion 

By methods analogous to those used in equations 74 to 80, two 
different batteries may be weighted in such a fashion that the correlation 
between the two composites will be maximized. This is the problem 
of the “most predictable criterion” solved by Hotelling (1935, 1936). 

Let us define the following matrices: 


R is the matrix of intercorrelations of tests in the first battery. 

S is the matrix of intercorrelations of tests in the second battery. 

C is the matrix of correlations of the tests of the first battery with 
those of the second. 

V is the row vector of weights (F) to be used for the first battery. 

W is the row vector of weights (IF) to be used for the second battery. 


Writing equation 7 for the correlation of two weighted sums in matrix 
notation, we have 


(81) 


VCW' 

RxvXw ~ x/VRV 7 VWSW 7 ' 


In order to avoid multiple solutions, some other restriction on the 
weights is necessary, such as adjusting the weights to make the variance 
of each composite unity. This corresponds to the restriction that 

(82) VRV' = WSW' = 1. 

Thus we may define a new function using two Lagrange multipliers, 
X and 7 , as 

(83) R = VCW' - ^ (VRV' - 1 ) - ^ (WSW' - 1 ). 


Differentiating with respect to the F’s and the TV’s and setting the 
derivatives equal to zero, we have 

CW' - XRV' = 0 


(84) 
and 

(85) 


vc - yws - o. 
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Premultiplying the first equation by V and solving for X, and post- 
multiplying the second equation by W' and solving for y, we find that 


( 86 ) 


_ x VCW' __ VCW' 

7 ~ X ~ VRV' “ WSW' = RxvXw ' 


Postmultiplying both terms of equation 85 by S -1 and solving for W, 
we obtain 

(87) W = -VCS-\ 

7 

Substituting equation 87 in the transpose of equation 84 factoring out 
V, and writing X 2 for the product y\, we have 

(88) VtCS^C' - X 2 R) = 0. 

By a corresponding procedure we find the solution for W as 

(89) W(C'R -1 C - X 2 S) = 0. 

Equations 88 and 89 have a solution for the weights other than zero 
only if the determinant of the coefficients of V and of W equals zero. 
We may write 

| CS^C' - X 2 RI = 0 

(90) | C'R _, C — X 2 S | = 0. 

Equations 90 are polynomials in X 2 . Since equation 86 shows 
X 2 = R 2 xvXw an d we are seeking the maximum correlation between 
the weighted composites, we choose the largest root of equation 90, and 
use this value in equations 88 and 89 and the conditions given by 
equation 82. This procedure completes the solution for the weights 
that will maximize the correlation between the two weighted composites. 

Setting C = C', R = S, and W = V in equations 88 and 89 gives 
the solution for maximizing battery reliability given by Thomson (1940). 
Such a solution of the problem of maximizing battery reliability is 
equivalent to the much simpler solution of equation 79. The following 
procedure for demonstrating this equivalence was suggested by Dr. 
L. R. Tucker of the Educational Testing Service. 

Postmultiplying both terms of equation 77 by R -1 C gives 

WCR -1 C = XWC. 


Multiplying both sides of equation 77 by X gives 

XWC - X 2 WR. 
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Thus we have the solution derived from simplifying equations 88 and 89, 
W(CR -1 C - X 2 R) - 0, 

where X 2 equals R 2 x w x w • Since the two solutions are equivalent, the 
simpler one indicated in equation 79 is to be preferred. 

The weights given by equations 88 and 89 constitute a mathematically 
very elegant method of weighting to secure a maximum correlation. 
However, it should be noted that, unless used with discretion, the 
procedure of determining the most predictable criterion has certain 
dangers. For example, if one of the criterion measures and one of the 
predictor measures happen to be tests of the same factor, both batteries 
are likely to be weighted to correspond with that factor. That is, if 
one test of spatial visualization is used in the prediction battery, and 
another test of spatial visualization in the criterion, while verbal and 
quantitative factors are represented in only one of the batteries, the 
most predictable criterion procedure is likely to result in warping the 
criterion to represent primarily spatial visualization. The blind accept¬ 
ance of such a result would mean that there would be no effort to repre¬ 
sent the other factors, such as, for example, verbal and quantitative in 
the predicting battery. In an extreme case such a procedure could mean 
that all factors in the criterion that were initially omitted from the 
prediction battery would always be omitted since they would receive 
very little weight in the criterion. 

The remedy here, as in all other uses of mathematical procedures, 
is to inspect the results to see if they have any peculiar characteristics. 
Any set of tests that have low weight for the criterion should be inspected 
to see if they would be regarded by experts in the field as being important 
and deserving of an important place in determining the total criterion. 
The prediction battery should be inspected to see if an attempt has 
been made to include the type of ability required by the criterion vari¬ 
ables that received low weight in determining the composite. In general 
we should alter the variables entering into the two batteries if the results 
do not seem to be appropriate. 

Another way of stating the caution given in the foregoing paragraphs 
is to say that a mathematical method should be adopted only to choose 
between alternatives that are judgmentally very similar. For example, 
if all the criterion variables are indifferent so that the expert would 
accept any set of weights positive or negative, the most predictable 
criterion results need not be questioned. However, if the expert judg¬ 
ment is that any set of positive weights are acceptable, it would seem 
proper to use the most predictable criterion only if all weights were 
positive. The expert judgment might be even more restrictive. For 
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example, if the criterion measures were grades and tests in college, and 
if these measures appeared to involve three types of abilities—verbal, 
quantitative, and spatial—the faculty might judge that any composite 
was acceptable as long as the verbal and quantitative factors had weights 
distinctly higher than those of the spatial factor. Given this judgment, 
we could use the most predictable criterion if the weights fell in the area 
indicated. Otherwise the problem for the technician would be to alter 
the criterion or predictor variables in some reasonable and acceptable 
fashion so that the weights would have reasonably appropriate relative 
values. In general, we may say that mathematical procedures are 
appropriately used when they serve to guide thought. If an attempt 
is made to utilize such routines as a substitute for thought, we may 
unwittingly arrive at and accept absurd conclusions. 

14. Differential tests 

Sometimes when a battery of tests is given, the problem is to obtain 
a single score from the battery. This implies that the battery is a one- 
factor battery or is to be treated as a one-factor battery. Sometimes 
it is desired to obtain several different scores from the battery. This 
implies that the battery represents several different factors, and a score 
is to be obtained for each of the factors. In this latter case the best 
procedure is to determine the factors present in the battery, and then 
to use the scores that best predict the factor scores of each individual. 

It may happen when this procedure is followed that the factor scores 
finally obtained still intercorrelate rather high so that, instead of having 
a set of differential scores, we have scores that in large part give different 
patterns of ability only through incidental errors of measurement. 
Whenever a set of supposedly differential scores are set up by factor 
analysis or other methods, it is desirable to make a check on the scores 
finally proposed to determine the extent to which such scores will give 
valid differentiation of different scores for the same individual. 

When the accomplishment quotient (A.Q.) was introduced, Kelley 
pointed out that the problem involved was to obtain reliable measures 
of each variable. Clearly these measures would be correlated, so that 
the accomplishment quotient might reflect only errors of measurement. 

Kelley (1923a) proposed a method for testing the extent to which two 
tests are giving differential scores to a set of persons. This method makes 
use of both test reliability and intercorrelation to determine the per¬ 
centage of scores that will show a reasonable difference and the 
percentage that will show such a difference solely through errors of 
measurement. The first percentage should be considerably larger than 
the second if the scores are to be used for their differential value. 
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If we use x and y to designate deviation scores in the two tests under 
consideration and the subscript t to designate the true scores in these 
variables, the difference between the observed score difference and the 
true score difference constitutes the error with which a difference in 
scores is measured. 


(91) e d = (x - y) - ( x t - y t ). 

By rearranging terms, the error of the difference may be written in 
terms of the error in x and in y, as follows: 

(92) e d = (x - x t ) - (y - y t ) = e x - e y . 

Since the errors in x are independent of the errors in y, the sum of 
squares may be written 

(93) 2e d 2 - 2e x 2 + 2e y 2 . 

Expressing the error in x and y in terms of the test reliability and 
standard deviation, we have , 

(94) sj = 2 e d 2 /N = s x 2 (l - r xx ) + Sl/ 2 (1 - r„). 

The magnitude of this term which is the variability of difference due to 
error may be compared with the total variability of differences. 


(95) 


2(* - y? _ 2z 2 2 y 2 22xy 

N ~F + ~N~ ~N~‘ 


From the equations for variance and correlation, we have 

(96) s 2 x _„ = s x 2 + Sy — 2r xu s x 8 y . 

If the tests x and y are expressed in standard score, the standard devia¬ 
tions and variances become unity, and equations 94 and 96 become, 
respectively, 

(97) ■ sj = 2-r xx - r yy = 2(1 - f) 

and 

(98) s 2 x _ v = 2(1 - r xv ), 

where f is the average reliability of the two tests. 

As the average reliability becomes markedly larger than the inter¬ 
correlation of the tests, the dispersion of obtained differences becomes 
greater than that obtained by chance. 

Kelley (1923a) proposed using normal curve proportions derived 
from equations 97 and 98 to find the percentage of observed differences 



Chap. 20) Weighting and Differential Prediction 353 


in excess of that which could be expected to occur by chance because of 
the error of measurement. 

To put this material in more familiar form, equation 21, Chapter 17, 
shows us the reliability expressed as a function of the error variance 
and total variance. Equations 97 and 98 in this chapter and 21 of 
Chapter 17 give 


(99) 

Simplifying, we have 

( 100 ) 


r x -y = 1 - 


2(1 - f) 
2(1 -r xy ) 


r x—y 


r - r xy 
1 -r xy ’ 


0 


where r x ~ y is the reliability of the difference between x and y, 
r xy is the correlation between tests x and y, and 
f is one-half the sum of the reliabilities of tests x and y. 


A similar equation is given by Conrad (19446), page 7. 


For any pair of tests , equation 100 gives the reliability of the 
difference as a fumtion of the inter correlation of the two tests 
( r xy ) and the average of the two reliability coefficients (f). 


Figure 3 is a linear graph showing the nature of this relationship. 
To use this computing diagram, mark the diagonal line corresponding to 
the intercorrelation of the two tests ( r xy ). Locate the average reliability 
at the bottom of the chart, then move up to the diagonal line and over 
to the scale at the right showing the reliability of the difference. In the 
illustrative problem (shown by the heavy dashed line in Figuie 3), if 
the average reliability is .6 and the intercorrelation is .5, the reliability 
of the difference is only .2. It should be noted that, if the average 
reliability of the two tests is about the same as their intercorrelation, 
the reliability of the difference is approximately zero. In order for the 
reliability of the difference to be .8, when r xy is .5, the average reliability 
of the tests must be .9. 

In setting up a profile type of battery in which differences in a given 
person’s score on different tests is important, we must be certain that the 
reliability of the difference in scores is fairly high before giving this 
difference much weight in the interpretation. Kelley (1923a) suggested 
that this reliability figure be interpreted in terms of the percentage of 
observed difference scores in excess of that which could be expected to 
occur by chance because of the errors of measurement. If this percentage 
were very small, the difference score would not be very useful. The 
interpretation in terms of reliability is more conventional; also the error 
of measurement can be obtained so that differences less than one or 
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-*y 
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r 

Figure 3. Showing the relationship txitween the average reliability of two tests, 
their intercorrelation, and the reliability of the difference between the two tests. 

Equation (100) r z ^ = --— or 

I — T X y 



two errors of measurement will not be interpreted as indicating a real 
difference in ability in the two tests. 

Many profile tests furnish no information on the reliability of the 
differences in scores. Such information should be required as a routine 
part of the validation and standardization of any battery to be used 
as a profile. 

The Differential Aptitude Battery of the Psychological Corporation 
has been set up in this manner; see Bennett (1947) or Bennett and 
Doppelt (1948). 
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Brogden (1946a) has presented a method for determining cutting 
scores in connection with the problem of differential prediction. A 
more complete analysis of the problems of differential prediction is 
given by Tucker (1948), Thorndike (1949), and Mollenkopf (1950). 


15. Summary 

Whenever a single total score is to be derived from a number of sepa¬ 
rate scores, the weighting problem cannot be avoided. However, if 
many different scores with reasonably high intercorrelations are being 
combined, the resulting composite will be fairly similar for a large 
variety of weights. If, however, relatively few items are to be combine^ 
there is a low correlation among these items, the standard deviation of 
the distribution of weights being considered is fairly large , and the 
correlation between two sets of weights is low , the two resulting com¬ 
posites will be different. The correlation between two composites 
obtained by using different weights on the same set of scores is given by 


Equation (15) 


R 


XvXw 


_ K{ 1 — f)(Cyw + VW) + (K 2 ~ K)C(yw)r + K 2 VWf 
lK(i-'f)(V 2 + V 2 ) ^ ’ 

+ (K 2 - K)C ( W)r 
4- K 2 V 2 f 


fm - f)(\v 2 + w 2 ) 

+ (K 2 — K)C(WW)r 


where K is the number of scores to be combined, 

W and V are the averages of the W and V weights, 
W and V are the standard deviations of the two 
sets of weights, 

f is the average intercorrelation of the 
scores to be combined, 

Cvw is the covariance between the two sets 
of weights used, 

C(vw)r, C'(W)r, and C(ww)r > represent the covariance of a product of 

weights with the corresponding corre¬ 
lation r g h , and 

RxvXw is correlation between the composite 
obtained using the F-weights and that 
obtained using the IF-weights. 


Equation 15 shows us that, if f, F, and W are positive, the correlation 
Rx v xw approaches unity: (a) as the correlation between the two sets 
of weights is increased or (5) as the standard deviation of the weights 
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(V, ff) is decreased in proportion to the mean weights (V and TP). 
Also, if the covariance terms of the form C(vw)r may be ignored, 
RxvXw approaches unity: (c) as K approaches infinity or (d) as f ap¬ 
proaches unity. 

Some appraisal of the magnitude of Rx v x w f° r positive weights may 
be obtained from 

where H is an approximation to the mean value of the correlation be¬ 
tween two weighted composites, and the other terms have the defini¬ 
tions indicated for equation 15. 

Equations 15 and 47 show us that, if a few scores with low intercorrela¬ 
tions are involved, and if also' we are considering alternative sets of 
weights with large variance and low intercorrelation, weighting will 
make an appreciable difference, and the selection of a “best” set of 
weights is important. For such a case we may say: 

1. If a criterion is available, the multiple correlation weights indicated 
by equations 52 and 53 for the general case and by equation 56 for the 
three-variable case will give the best results, in the sense that the 
correlation between the weighted composite and the criterion will be a 
maximum (see equation 55 for the general case and 58 for the three- 
variable case). For practical purposes, simple integral approximations 
to the exact multiple weights will usually give a satisfactory composite 
score. 

2. If many variables are involved, and particularly if a selection is to 
be made among these variables, some approximation to multiple correla¬ 
tion as indicated in section 3 is to be preferred to the exact method. 

3. Where no criterion is available, various weighting methods have 
been adopted: 

(а) Weighting in terms of the average score or the perfect score, which 
is usually equal to the number of items in an objective test, is 
always to be avoided. There is no justification for the belief that 
these factors have or should have any effect on the weight of an 
item in a composite. 

(б) Weighting inversely as the standard deviation (1 /s x ) or inversely 
as the error variance, that is, by the factor r/(l — r) has been 
suggested in the literature on testing. Both these methods depend 
on a rationale involving assumptions that are probably never 
satisfied in practice. Also there are many situations in which 
either method will give results clearly inappropriate. 
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(c) Weighting to equalize “marginal” contributions to total variance 
has been suggested (see section 11). This method also has peculiar 
properties. 

(i d ) Weighting inversely as the error of measurement was discussed 
in section 6. No rationale is at present available for this method, 
and it seems not to have been used or suggested before. From a 
common-sense point of view, this method has a valuable set of 
properties; and, of the different rule-of-thumb methods presented 
in this chapter, it is the one that would seem to be most generally 
acceptable. 

4. Where no criterion is available, it is probably best not to use axv/ 

rule-of-thumb method. Two alternatives are suggested here. 

(a) We may weight the items so as to maximize the reliability of the 
composite by using the matrix formula 

(79) W(C - XR) = 0, 

where R is the matrix of intercorrelations among the variables (with 
unity in the diagonals), 

C is the same matrix with the substitution of reliabilities for unity 
in the diagonals, 

W is the vector of weights, and 
X is chosen as the largest root of 

(80) | C - XR | = 0. 

(b) We may depend on expert judgment for the determination of 
weights. For a system of correlated variables there is no satis¬ 
factory method of assessing the proportional contribution of each 
component to the total. The best guide is the correlation of each 
part with the composite. The correlation of any part ( g) with the 
composite (C) is given by 

K 

WVs + £ w h r gh s h 

(71) r t c -—- (|1 * h), 

Sc 

where r g c is the correlation between part g and the weighted 
composite (C), 

W g (or Wh) is the weight assigned to any part (g or A), 
s g (or 8h) is the standard deviation of that part, 
r g \ is the correlation between two parts, and 
sc is the standard deviation of the composite. 
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In equation 71 the correlation between part g and the composite is 
shown to depend entirely upon (a) the weight of that part (TF*), (b) the 
standard deviation of that part ( s g ), and (c) the sum of products 
WhTghSh for all other parts entering into the composite. 

In judging what weights are to be assigned to the different parts, we 
must understand clearly that these are the only factors determining the 
“weight” of the part in the composite, and that the correlation between 
the part and the composite is the best criterion to use in judging the 
effective weights of each part in the composite. If the judges will assign 
certain relative values to the correlations r g c, the solution of equation 72 
for Wh will give the relative weights for the different parts. 

5. If multiple criterion and predictor measures are available, we may 
select weights so that the correlation between the two composites will 
be maximized. These weights are given by the matrix equations, 

(88) VfCS^C' - X 2 R) = 0 
and 

(89) W(C'R -1 C - X 2 S) = 0, 

where R and S are the matrix of intercorrelations among the variables 
of each set, 

C is the matrix of correlations of variables in one set with 
those in the other set, 

V is the set of weights to be applied to the variables of the 
R-matrix, 

W is the set of weights to be applied to the variables of the 
S-matrix, and 

X 2 is chosen as the largest root of 
| CS-’C' - X 2 R I = 0 

(90) | C'R -1 C — X 2 S | = 0. 

The weights given by this system are so flexible that we may easily 
be led to a composite criterion that is undesirable from a judgmental 
point of view. When using “the most predictable criterion” it is neces¬ 
sary to inspect the weights carefully from the point of view of the expert 
judge in order to avoid accepting an unreasonable criterion. 

6. It has been suggested that the methods of factor analysis may aid 
in solving the weighting problem. Where a single score is to be deter¬ 
mined, it has been suggested: 

(a) That the first principal axis be used as the best representative 
of the set of scores. This method has a number of very interest- 
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ing properties. It maximizes the interindividual differences, 
minimizes the differences between the various scores obtained by 
a given person, and minimizes the “generalized variance” for all 
individuals receiving the same score. Despite such a set of 
properties, however, it has two serious disadvantages. It is 
laborious to compute, since it necessitates a successive approxi¬ 
mations procedure, and it is sensitive to the units in which the 
various tests are measured. 

(6) That the first centroid axis be used. It may be a good possibility 
if some one score is desired to represent a set of scores. 

(c) That, if the set of scores is actually a one-factor system, the onp 
common factor would seem to be a very good choice for the 
composite score. Spearman suggested this solution, and gave the 
equations for it. However, a set of tests must be very carefully 
selected if it is to contain only one factor. This rule, therefore, 
could be applied in only a very few cases. 

If several scores are to be derived from a set of tests, the best procedure 
would be a factor analysis procedure. The battery should be factored, 
and a score assigned for each principal factor that is determined. It 
should also be noted that, whenever several different scores are assigned 
to each person in a group, and differential use is made of these scores, it 
is necessary to assess the reliability of the score differences. This relia¬ 
bility is given by 

f — r xy 

(100) fx —y ~ ~ > 

1 r x y 

where r x — y is the reliability of the difference score, 

r xy is the correlation between the two tests, and 
f is half the sum of the two reliabilities. 

A computing diagram for this equation is given in Figure 3. Equation 
100 shows that, unless the average reliability of two tests is considerably 
higher than the correlation between them, the differences will be very 
unreliable. This means that in making differential predictions, or in 
interpreting profiles, judgments will usually be made on the basis of 
accidental score differences. Unless r x __ y is .80 or larger, valid judgments 
of individuals cannot be made on the basis of score differences between 
tests x and y . All differential prediction batteries or batteries that are 
to be used as profiles should give information on the reliability of the 
difference score for each pair of tests in the battery. 
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Problems 

1. What is the expected value of the correlation between any two composites 
from a battery of forty tests 4 with an average intercorrelation of .30? Assume that 
only positive weights are used and that the average weight is about equal to the 
standard deviation of the weights. 

2. On the foregoing assumption regarding weights, what is the expected value 
of the correlation between any two composites from a battery of five tests with an 
average intercorrelation of .20? 


Data for Problems 3 to 7 


Entering freshmen at the University of Chicago are given an A. C. E. Psy¬ 
chological Examination (a), a physical sciences aptitude test (s), an English 
placement test ( e ). A year later they are given the physical science comprehensive 
(p) and the humanities comprehensive ( h ). 

The following zero-order correlations are obtained: 


r as — 

.50, 

r ae — 

.70, 

r« * .40, 


.50, 

r »p — 

.70, 

§ 

« 

rah " 

.60, 

r»h = 

.20, 

r eh = .70, 


The following means and standard deviations are found: 



a 

s 

€ 

P 

h 

Mean 

120 

110 

150 

220 

460 

Standard deviation 

30 

20 

25 

30 

40 


3. Write the equation for making the best prediction of the humanities compre¬ 
hensive score from the three placement tests. 

4. What will be the correlation between the predicted humanities scores and the 
actual scores, using the prediction equation given in 3? 

5. Which two placement tests will give the best prediction of scores in the physical- 
science comprehensive? 

0. Write the equation for making the best prediction of the physical-science 
comprehensive score from the two tests mentioned in 5. 

7. What is the correlation between the actual physical-science scores and the 
scores predicted by using the equation given in 6? 
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Data for Problems 8 to 17 



Number 
of Items 

Mean 

Std. 

Dev. 

Formula for 
Transformed 
Scores 

Relia¬ 

bility 

Intercor¬ 

relations 


( K) 

(T> 

c*> 

<n 

r xx 


Test a 

10 

6.1 

1.3 

F - 10X 

.62 

Tab * .36 

Test b 

50 

35.6 

4.2 

F - X 

.83 

r ae ** -42 

Test c 

200 

153.7 

15.8 

F - X/2 

.95 

rbc — .65 


X -X 

Z m - w — 


X , F, and z scores for the following problems are defined in the foregoing 
table. 

8. For the data given in the table, discuss the desirable and undesirable charac¬ 
teristics of a composite score formed by weighting X a , X&, and X c according to the 
reliability of each test, as indicated in equation 66. Would the composite be the same 
or different if the F-scores or the 2 -scores were weighted as indicated in equations 
65 or 66? 

9. Give the desirable and undesirable characteristics of a composite formed by 
weighting the X-scorcs inversely as the standard deviation of the test. Would the 
composite be the same, or different, if the F-scores or the 2 -scores were weighted 
according to the same principle? 

10. Give the desirable and undesirable characteristics of a composite formed by 
weighting the X-scores, F-scores, and 2 -scores inversely as the error of measurement. 

11. Give the desirable and undesirable characteristics of a composite formed by 
weighting the X-scores, F-scores, and 2 -scores inversely as the error variance. 

12. Give the desirable and undesirable characteristics of a composite formed by 
weighting the X-scorcs, F-scores, and 2 -scores directly as K, the number of items in 
the test. 

18. Give the desirable and undesirable characteristics of a composite formed by 
weighting the X-scores, F-scores, and 2 -scores inversely as K. 

14. Give the desirable and undesirable characteristics of a composite formed by 
weighting the X-scores, F-scores, and 2 -scores inversely as the test mean. 

16. Judging in terms of the correlation of the part with the composite (equations 
71 or 72), how would o, b, and c be weighted by (a) adding X-scores; (6) adding 
F-scores; (c) adding 2 -scorns; (d) taking the composite, T — 10X o + 2X6 + X c . 
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16 . What weighting factors should be assigned to the X-scorcs in order to obtain 
a composite t, such that r a t, r&*, and r ct are approximately in proportion to 2, 3, and 4, 
respectively (see equation 72). 

17 . What is the reliability (a) of the difference score X a — X&; (6) of the difference 
score X a — X c ; (c) of the difference score X& — X c ? 

18 . Prove that gross score weights that are inversely proportional to the error 
variance for gross scores are identical with the weights of equations 65 and 66. 

19 . (a) Find the standard score weights that are inversely proportional to the 
error variance of a standard score. Compare this weight with that of equation 65. 

(6) Determine the gross score weights that will give results identical with these 
standard score weights, and compare these gross score weights with those of equation 

ee. 



21 

Item Analysis 


1. Introduction * 

Basically, item analysis is concerned with the problem of selecting 
items for a test so that the resulting test will have certain specified 
characteristics. For example, we may wish to construct a test that is 
easy or one that is difficult. T n either case it is desirable to develop a 
test that will correlate as high as possible with certain specified criteria 
and will have a satisfactory reliability. The index of skewness should 
be positive, negative, or zero for a specified population. If a battery 
of several tests is being constructed, it may be desirable to have the 
intercorrelations as low as possible. It is also of considerable interest 
to be able to construct a test so that the error of measurement is a 
minimum for a specified ability range or so that the error of measurement 
is constant over a wide ability range, as is assumed in the development 
of formulas for variation in reliability with variation in heterogeneity 
of the population (see Chapters ID, 11, and 12). In each of these situa¬ 
tions it would be convenient to be able to write the prescription for item 
selection so that we should be able to subject a set of K items to an 
appropriate type of analysis, and then to select the subset of k items 
that would come nearest to satisfying the desired characteristics. 

As yet the rationale of item analysis has been developed for only a 
few of the problems indicated. Numerous arbitrary indices have been 
devised and used. Twenty-three methods are listed and described by 
Long and Sandiford (1935). Nineteen methods are summarized by 
Guilford (19366) in Psychometric Methods , pages 426-456. With one or 
two exceptions, these lists are essentially the same. For earlier surveys 
of item analysis methods, see Cook (1932) or Lentz, Hirshstein, and 
Finch (1932). The striking characteristic of nearly all the methods 
described is that no theory is presented showing the relationship between 
the validity or reliability of the total test and the method of item 
analysis suggested. The exceptions, which show a definite relationship 
between the item selection procedure and some important parameter 
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of the test, are Richardson (1936a); the method of successive residuals, 
Horst (19346); the use of a maximizing function, Horst (19366); the 
L-method and its various modifications (see Toops, 1941, Adkins and 
Toops, 1937, and Richardson and Adkins, 1938). 

In developing and investigating procedures of item analysis, it would 
seem appropriate, first, to establish the relationship between certain 
item parameters and the parameters of the total test; next, to consider 
the problem of obtaining the item parameters in such a way that they 
will, if possible, not change with changes in the ability level of the 
validating group; and, last, to consider the most efficient methods, from 
both a mathematical and a computational viewpoint, of estimating these 
parameters for the items. 

The method of item selection used and the theory on which it is based 
must be directly related to the method of test scoring. For the usual 
aptitude or achievement test, the responses to each item may be classified 
as either correct or incorrect, and the item analysis procedures utilize 
this information. For items relating to personality, interests, attitudes, 
or biographical facts, the responses cannot be classified as either correct 
or incorrect. A set of such items demands a more complex type of 
item analysis procedure that not only gives information on item selection 
but also furnishes a scoring key. If an achievement or aptitude test is 
scored in terms of “level reached,” it would seem appropriate to use the 
item analysis methods of absolute scaling (see Thurstone, 1925 and 
19276) or some other analogous scaling method. Such procedures do 
not seem appropriate for the usual test that is scored by counting the 
number of correct responses. In this chapter we shall consider only 
those item analysis procedures suitable for the case in which the item 
responses may be classified as correct or incorrect and in which the score 
is the number of correct responses. 

Another consideration that will affect item analysis methods is the 
extent to which the group available for item analysis purposes is similar 
to or different from the prospective test group. For example, a group 
of students in a college with high admission standards might be the only 
group available for experimental purposes for a test that is to be generally 
used for college admission. In this case item information from a group 
of high ability is to be used in constructing a test to be used for a group 
with a lower average ability and a larger variance in ability. Other 
variants of this problem may arise. For example, considerable item 
analysis data may be available on a large population of applicants for 
college admission, and we may wish to use this information in selecting 
items suitable for a scholarship examination that is to be taken only 
by superior students. The item selection problem is clearly much 
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simpler when the item analysis group and the prospective test group 
are similar in mean and variance on the particular ability to be tested. 

It is important to note that, while the item analysis rationale and the 
quantitative item selection procedures are the same for aptitude and 
achievement tests, there is one important difference. In the construction 
of aptitude tests the item statistics may be allowed to control the 
rejection and selection of items more fully than in the construction of 
achievement tests. The judgment of the subject matter expert must 
always play an important part in the selection and rejection of items 
for an achievement test. If the item analysis results show that a given 
item should be used, and the expert finds that the item is incorrect^ 
that item must be revised. If the item analysis results show that an 
item should be deleted, and the subject matter expert feels that essential 
knowledge is being tested in the item, then attempts must be made to 
discover the flaw in item construction and revise the item so that it 
will satisfy both the item analysis criteria and the judgment of the 
subject matter specialist. It may even be that the fault lies in the 
teaching methods used so that the item that is unsatisfactory from the 
viewpoint of item analysis statistics will show satisfactory item analysis 
results for a new class that has been taught differently. In an achieve¬ 
ment test the goal should be to obtain items that are satisfactory from 
the viewpoint of both the item analysis results and the subject matter 
specialist. In order to do this, it may be necessary to revise the item, to 
revise the criterion against which the item is validated, to revise the 
methods of teaching, or the content of the course. 

Relatively little of a precise nature is now known regarding the effect 
of item selection on test skewness, kurtosis, or on the constancy of the 
error of measurement throughout the test score range. It is possible, 
however, to select items in such a way as to influence the test mean, 
variance, reliability, and validity. We shall now consider item selection 
in relation to these four test parameters for tests that are scored by 
counting the number of correct responses and are composed of items the 
responses to which are either correct or incorrect. It will also be assumed 
that the item analysis group and the prospective test group have similar 
means and variances of the ability to be tested. 

2. Item parameters related to the test mean 

Let Aig designate the score of the ith person on the gth item. As 
shown in Table 1, A ig is unity if the ith person answered 
the gth item correctly, and zero if the answer was incorrect. 
N be the number of persons taking the test, (i = 1 • • • N), and 
K be the number of items in the test, (g = 1 • • • K). 
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Since each person’s score is the number of items correctly answered, 
we may write 

( 1 ) Xi = Z Aim- 

g = l 

That is, each person’s score is the sum of the entries in a row, as shown 
in Table 1. 

Table 1 


to, K - 1 • • • 

12 3 4 

110 0 
10 11 
0 111 
0 0 0 1 


0 Xi 

1 X 2 

0 Xi 
1 Xi 


Individuals 
(i,j, -1 .-AT) * 


N 0 10 0 

Sums d\ d 2 di di 

Sums ~r N pi p 2 ps pa 


The test mean is given by 


ZXi 


M x = 


0 X N 
d K 2d = 2X 
2d 2X 
VK N ~ N 


Substituting equation 1 in equation 2 and noting that the grand total 
given by adding the row sums is the same as that given by adding the 
column sums, we may write 


Mx = 


N K K N 

Z H A u H 2 A ig 

t=l g sal gsssl ism 1 


For any given item g the item difficulty is defined as the proportion of 
correct responses. Designating the difficulty of item g by p g , we have 


2^ A{ g 


Vs - 
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Substituting equation 4 in equation 3, we write 

K 

(5) X or M x = Y, Pt = Kp, 

where Mx is the test mean, also designated X, 

p g is the proportion of correct responses for the firth item 

If is the number of items in the test, and 

K 

p is (l/iV)53 p g , the average item difficulty. 

✓ 

If test score is taken as the number of correct answers , the test 
mean is equal either to the number of items multiplied by the 
average item difficulty or to the sum of the item difficuUies y 
when item difficulty is defined as the proportion of correct re¬ 
sponses. 

It should be noted that equation 5 holds only if “correct response” 
and “incorrect response” are defined in the same way for both test scoring 
and item analysis purposes. For example, if the score is “number right,” 
items answered incorrectly, items skipped, and items omitted will each 
count zero in determining total score. They must then be similarly 
counted when obtaining p g . Table 1 shows that we have assumed a 
matrix of “Ts” and “0’s.” These terms are added by rows to determine 
the score of each person, and the same terms are added by columns to 
determine item difficulty. If the test is a power test, item difficulty 
defined as the proportion of correct responses will represent a charac¬ 
teristic of the item in relation to the ability of the group. If the test is a 
speed test, p g is entirely or primarily a characteristic of the position of 
the item in the test and the timing of the total test. For a speed test, 
“proportion of correct responses” does not represent a characteristic of 
the item; hence this type of analysis is inappropriate insofar as a test is 
speeded. 

3. Item difficulty parameters that compensate for changes in 
group ability 

Several measures of item difficulty have been suggested that allow 
for the possibility that the item, analysis group may be different from 
the prospective test group. 

Thwr stone's difficulty calibration method (Thurstone, 1947a), which he 
has used in the construction of the American Council on Education 
Psychological Examination, is the simplest and most direct method of 
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z p is a normal base line equivalent of p, the percentage correct 
for the item (if p > .50, z p < 0; if p < .50, z p > 0). The 
other terms have the same definition as in equation 7,2 being 
taken as unity. 

The item criterion curve has also been suggested as giving an indication 
of the ability level at which half the persons would fail and half pass the 
item. In this method the group taking the test is divided into five, 
ten, or twenty subgroups on the basis of some criterion, usually the 
total test score. These groups are taken as representing various ability 
levels. Then the percentage correct on a given item for each of these 
subgroups is computed. In general it is found that only a small per¬ 
centage of the lowest group gets the item correct and that a larger and 
larger percentage of each succeeding ability group gets the item correct. 
From this information we can determine by interpolation (or extrapola¬ 
tion, in the case of very easy or very difficult items) the ability level at 
which half the persons would answer the item correctly and half in¬ 
correctly. This level then represents the criterion level at which half 
fail and half pass the item, and is taken as indicating the item difficulty. 
If the assumptions for a biserial correlation coefficient are met, this 
method will give results identical with those obtained by equation 8, 
since its purpose and method of procedure are essentially identical. 

The four types of methods just discussed, Thurstone’s method of 
calibrating item difficulty, the normal curve transformation as repre¬ 
sented in equation 6, the transformation based on the regression line 
as shown in equation 8, or the use of the midpoint of the item curve, 
may all be regarded as attempts to find an item difficulty parameter 
that is invariant with respect to changes in the mean or dispersion of 
the ability of the group. As far as the author is aware, there is no pub¬ 
lished experimental evidence to show how well any of these methods 
succeeds in its purpose. The first and last methods are simple and direct, 
involving no assumptions such as those in equations 6 or 8. However, 
if the assumptions of biserial correlation are justified, it would seem that 
the method represented by equation 8 is best since it makes use of all 
the available data to determine the item difficulty level. 

If the total test score is to be determined by counting the number of 
items answered correctly, it does not seem particularly appropriate to 
measure item difficulty in terms of criterion level, as is done in equations 
6,8, and the item curve method. Such measures of item difficulty would 
seem appropriate for a test that is to be scored in terms of “level reached” 
or for a test that is constructed by the absolute scaling principles (Thur- 
stone, 1925 and 19276). However, if these item difficulty measures in 
terms of criterion level turn out to be relatively invariant with respect 
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to changes in group ability level, it should be possible to translate them 
into different “percentage correct” scores corresponding to the particular 
group to be tested. 


4. Estimates of the percentage who know the answer to an item 

Other measures of item difficulty have been devised to estimate the 
percentage of persons in the group that “know” the answer to the item, 
as distinct from those who guess, and guess correctly. 

Guilford (1936a) has suggested that the usual method of correcting 
for chance be applied to items as well as test scores. This method 
involves two assumptions. ¥ 

1. That the persons can be divided into two groups, (a) those who 
know the answer and ( b ) those who guess the answer. 

2. Those who guess are equally likely to select any one of the alterna¬ 
tives given. 


Let / designate the number of different answers given for an item, 
then l//th of those who “guessed” would guess correctly, and (/ — 1)// 
would guess incorrectly. Since this latter group includes all who 
answer incorrectly (by assumption 1 above there is no misinforma¬ 
tion leading to the incorrect answer), 1 /(/ — 1) of those who answer 
incorrectly is equal to the number of lucky guessers; hence, subtracting 
(Number wrong)/(/ — 1) from the number right will give the number 
who got the right answer not by guessing but by knowledge. The 
percentage who know the answer (designated p') may be written 


(9) 




Wi 


p' = 


where Ri is the number of correct answers to the item, 

Wi is the number of incorrect answers to the item, 

/ is the number of possible answers given for each item, 

T is the total number who tried the item [T may be considered 
equal to rights plus wrongs (Ri + Wi) or may also include 
those who skipped the item] and 
p' is an estimate of the percentage knowing the answer to that 
item. 


It should be noted that one implication of this method is that the 
same number of persons will select each of the incorrect alternatives, 
and that some number greater than this will select the correct alternative. 
Investigation of any multiple choice test will show that rarely, if ever, 
are all the distractors equally attractive. Horst (1933) has suggested 
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an item difficulty measure for multiple choice items that assumes that 
the different distractors are unequally attractive. 

Horst (1933) makes the two assumptions indicated for equation 9, 
and in addition he assumes that those who do not know the correct 
answer fall into various subgroups. The first subgroup is composed of 
those who know nothing about the alternatives in question; hence the 
members of this group are distributed equally to all of the / possible 
answers. A second group is composed of those who know that one of the 
alternatives is wrong, hence distributes its answers uniformly over the 
remaining / — 1 choices, and so on. The next-to-the-best group knows 
that all but two of the alternatives are wrong, hence distributes its 
choices evenly between the correct answer and one of the incorrect 
choices. The best group is composed of those who know the right 
answer and those who know that each of the other choices is wrong, 
hence pick the right answer by elimination. According to this reason¬ 
ing, the number of persons in this last group is equal to the number 
choosing the correct alternative minus the number who mark the most 
popular incorrect alternative. 

Let us consider what would happen to a five-alternative item. Let 5 a 
designate the number of persons knowing nothing. Since they dis¬ 
tribute equally among the five alternatives, a persons will choose each 
of the five alternatives. Let 4 b designate the number who know that 
one of the alternatives is wrong; b of them will choose each of the other 
four answers. The next group is designated by 3c, c of whom will choose 
the correct answer and c of whom will choose each of the two most popu¬ 
lar wrong answers. Assume that 2d persons know enough to avoid all 
but one of the distractors, hence divide equally between it and the 
correct answer. Finally we have e persons who know the right answer, 
or else know that all the others are wrong; hence all these e will pick 
the correct answer. Let us use W\ to designate the number picking the 
poorest distractor, W 2 for the number picking the next most popular, 
and so on up to Wj-\ for the number picking the most popular distractor. 
Then we may write 

W x = a, 

W 2 = a + b t 

H 3 — CL -j- h -f- c, 

W 4 - a + b + c + d, 
ft == a -f" b -j— c —|- d -j“ 

Thus we have 

c - ft — W 4 . 
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In general we see that the number of persons who know the correct 
answer is equal to the number marking the correct answer minus the 
largest number selecting any one of the incorrect answers. If we desig¬ 
nate the corresponding estimate of the percentage knowing the correct 
answer by p", we have 


( 10 ) 


P"- 


R - Wf-i 
T 


where R is the number of persons selecting the correct answer, 

Wf—i is the number selecting the most popular incorrect answer, and 
T is the total number of persons responding to that item. 

¥ 

This method has the distinct advantage over equation 9 that it takes 
account of the fact that different numbers of persons will pick the differ¬ 
ent distractors in an item. It also furnishes a criterion for the possible 
presence of actual misinformation. According to the theory, more 
persons will select the correct alternative than will select any of the 
incorrect alternatives. This is a fact as a consequence of the assumption 
that any subgroup with a given amount of information will distribute 
equally among the alternatives they do not know to be false. In ampli¬ 
fying this theory, allowance should be made for chance variations from 
such a distribution. We may say, however, that, if a considerably 
greater number of persons select one distractor than select the correct 
answer, it is likely that some actual misinformation exists in the group, 
and the method indicated in equation 10 does not apply. A method 
of test scoring appropriate for the measures of item difficulty shown in 
equations 9 and 10 has not been suggested. 


5. Item difficulty parameters—general considerations 

Innumerable other measures of item difficulty have been suggested 
that are based on the percentage correct for the upper and lower K 
per cent of the population; see Cook (1932), Lentz, Hirshstein, and 
Finch (1932), Guilford (19366), Kelley (1939), and Davis (1946). The 
upper and lower k per cent are chosen on the basis of total test score, 
and k has been given various values such as 10, 20, 25, 27, 33. Such 
difficulty measures are usually incidental to methods for obtaining a 
rapid approximation to the correlation between item and test score. 
Insofar as they are measures of item difficulty, they are regarded as 
approximations to the basic statistic of percentage of persons answering 
correctly. In general, the proper method of evaluating a statistic that 
is an economical approximation to some other statistic is 

1. To determine the standard error or confidence interval for each of 
the statistics. 
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2. To determine how many cases must be used for each method in 
order to give statistics of equal precision. 1 

3« To determine the dollar cost of each method for the number of 
cases indicated in step 2. 

Thus the expense of obtaining statistics of equal precision is deter¬ 
mined, and the cheaper method may then be advocated. As far as the 
author is aware, none of the statistics indicating item difficulty or item- 
total correlation have been subjected to such theoretical and experi¬ 
mental comparisons. Thus we do not have the only type of information 
that is relevant for judging the relative merits of the different short-cut 
methods. 

In summary, then, we may say that the methods of item analysis 
should be considered as a part of the total test theory problem. The 
theoretical relation between the item parameters and test parameters 
should be shown. In the test theory presented here the number correct 
is the score, and, since the mean test score is the sum of the proportion 
of correct responses for each item, there is a very simple relationship 
between item difficulty and test mean provided item difficulty is meas¬ 
ured as the proportion of correct responses. 

The only other difficulty measure that is consistently related to a 
method of test scoring is the median ability level for the item. This 
measure of item difficulty is appropriate for tests set up and scored by 
methods of absolute scaling. 

The other measures of item difficulty have been set up to cope with 
special problems, such as change in ability level of the group, the problem 
of guessing, or the problem of inadequate clerical help, necessitating 
abbreviated methods. Theoretical and experimental information ade¬ 
quate for evaluating these methods is not yet available. 

There have been several empirical studies that show that tests com¬ 
posed of items answered correctly by about 50 per cent of the group 
have a higher validity than tests composed of items that are easier or 
harder than 50 per cent, but otherwise of the same type. See, for ex¬ 
ample, Cook (1932), T. G. Thurstone (1932), and Richardson (1936a). 
In section 8 of this chapter, an equation showing the relationship be¬ 
tween item parameters and test validity is developed (equation 24). 
This equation does not show any direct relationship between test validity 
and item difficulty. Test validity, however, does depend on the point- 
biserial item-criterion correlation. This correlation may increase 
rapidly, as items approach a 50 per cent difficulty level; see Carroll 

1 The paper by Mosteller (1946) illustrates a good theoretical comparison of several 
different methods of estimating a parameter. 



375 


Chap. 21] Item Analysis 

(1945) and Gulliksen (1945). Hence it is suggested that the higher 
validity found for tests composed of items with 50 per cent difficulty 
may be due to and directly measured by the increase in item-criterion 
correlation. 

6. Item parameters related to test variance 
Another item analysis problem is selecting items in order to control 
the standard deviation of the total test score ($ x ). We may, for example, 
wish to select a subset of k items out of a total of K items in such a way 
as to have a fc-item test with the largest possible standard deviation, 
the smallest, or so that its standard deviation will equal as closely as 
possible that of another test. 

Equation 9 of Chapter 7 gives the variance of a composite as the sum 
of all the terms in the variance-covariance matrix. If the complete 
variance-covariance matrix were available for a set of items, it would 
be possible to add the variances and covariances for different possible 
subsets of items and to find the variance of total test score for each 
possible subset of items. For any large number of items, however, the 
amount of labor required to do this is very great. The procedure 
usually seems impractical with present computational facilities. 

We can obtain a reasonably useful result by working with the correla¬ 
tion between the item and total test score. From equations 3 to 7, 
Chapter 7, we learn that, if a composite gross score is formed by adding 
gross scores of parts, the deviation score for the composite is the sum 
of deviation scores for the parts; hence from equations 1 and 5 we have 

K K 

(11) Xi = Xi - X = 52 (A ig - Vg) = H OLig, 

g=l g=l 

where x t - designates the deviation score for the test, and 
designates the deviation score for the item. 

Designating the standard deviation of item g by s g) that of the total 
test by s X) and the item-test correlation by r xg) we may write 

N 

(12) ^TggSgSx = ^ £t®*g« 

*—1 

Substituting equation 11 in equation 12 and reversing the order of 
summation gives 

K N 

(13) Nr xg SgS x — 22 dih^ig- 

1 t —1 

Note that it is necessary to use two different subscripts (h and g) to 
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indicate that, for a given item g , we take the cross products with all 
items (h *= 1 to K, including g). Since the terms of the form 2 a,ha g /N 
indicate an interitem covariance, we may divide both sides by N and 
write 

K 

(14) Txg$g$x ™ 23 fgh^gSh (?gg “ !)• 

A-=l 

Since g is a number from 1 to K , and h varies from 1 to K , there will 
be one term in the summation where h = g. This term will be a variance, 
and the other K — 1 terms will be covariances. To indicate this ex¬ 
plicitly, we write 

K 

(16) TxgSg^x = Sg 2 4” 23 TghSgSh (^ ^ 9)j 

h=l 

where s g and 8h are it em stan dard deviations, which may be written 

Vp(i - p), 

r g h is the fourfold point correlation of items g and h , 
r xg is the point-biserial correlation of item g with the total 
test composite x , and 

s z is the standard deviation of total test score. 

In other words, the sum of the terms in any one column (or row) of 
the interitem variance-covariance matrix is the covariance between 
that item and the total test score. By using the gross score formula for 
variance and covariance, these results may be expressed in terms of the 
proportion answering an item correctly and the proportion answering 
both items of a pair correctly. From equation 11 and the definition of 
covariance we have 

N N 

Nr fh s g s h = 2 ( A u ~ Pi)( A ih ~ Ph) = 22 A n A ih - Np t p h . 

1 1 

Since the term 'LA ig Aih is zero if either factor is zero and is unity if 
both factors are unity, the summed products are equal to the number 
of persons answering both items g and h correctly. This may be verified 
with the help of the illustrative table of scores (Table 1). Dividing by 
N, we have the proportion of persons answering both g and h correctly, 
which will be designated p g h- Thus the interitem covariance is 

(16) r gh s g s h = p gh - p g Ph- 

For the variance of an item, we have the special case of equation 16 
in which h = g. In this case p g h becomes p gg) which is identical with 
p g ; thus we have 

(17) s e 2 - 


Pt ~ Pg = Pg( 1 - Pg)- 
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The item-test covariance shown in equations 14 and 15 may, by the use 
of equations 5 and 16, be written 

K 

(18) r xg s g s x = 23 Vgh - MxVt (Pgg = Pg), 

h = 1 

where p g is the proportion of persons answering item g correctly, 

p g h is the proportion of persons answering both item g and item h 
correctly, 

M x is the mean of the total test score, and the other terms have the 
definitions indicated for equation 15. 

Substituting equation 15 (Chapter 21) in equation 9 (Chapter 7) and 
designating the test variance by s x 2 , we have 

K 

(19) sj* = ^ 1 r xg s g s x . 

i 

The sum of the item-test covariances is equal to the sum of the 
terms in the interitem variance-covariance matrix , which is 
equal to the test variance. Thus the test variance is expressed 
in terms of item parameters. 

Since s x is a constant when summing over g , the right-hand side of 
equation 19 may be written s x 2r xg s g . Dividing both sides by s x gives 

K _ 

(20) X or s x = 23 r xgSg = K(r xg s g ). 

g=l 

Define the product r xg s g as the “reliability index” for item g. 

Then the standard deviation of the total test score {designated 
s x or X) is equal to the sum of the item reliability indices . 

It should be noted particularly that no approximations were used in 
deriving equations 19 and 20. The only possible reason for either of 
these equations failing to work in any particular case is the occurrence 
of an arithmetical error in the calculations. It should also be noted 
that, in terms of the derivation, r zg must be a point biserial correlation. 

Unfortunately, however, these equations hold exactly only for the 
standard deviation of the total test. For a subtest made up of a subset 
of items, the sum of the item reliability indices based on correlation of 
item with total test score will not exactly equal the standard deviation 
of the subtest. For example, if the interitem correlations are nearly 

equal and all positive, the sum of the reliability indices for half the items 
/ x/2 \ 

in the test r xg $ g j will give a value larger than the standard deviation 
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of the test composed of half the items because for approximately parallel 
items the correlation of an item with a longer test will be greater than 
its correlation with a shorter test. This may be seen from equation 5, 
Chapter 9. 

However, a test composed of items with large reliability indices will 
probably have a greater standard deviation than one composed of items 
with small reliability indices. Also if the items in two tests are matched 
simultaneously with respect to two item parameters such as r xg and p g 
(since s g is a function of p g , see equation 17), the two tests will have 
closely comparable means and standard deviations. Refer to the 
method sketched in Chapter 15, section 7, and Figure 3 of Chapter 15. 
We shall see in the next section that the reliability of a test is determined 
by the item variances and interitem covariances, together with the 
number of items, so that matching two tests item for item with respect 
to both r xg and p g would give tests with similar reliabilities as well as 
similar variances. 

7. Item parameters determining test reliability 

The equation showing the relation between number of items, item 
variance, item reliability index, and test reliability may be written by 
substituting equation 20 (Chapter 21) in equation 10 (Chapter 16). 
This gives 



where K is the number of items in the test, 

s g 2 is the item variance which equals p g — p gy 
r xg s g is the item reliability index, and 
r xx is the reliability of the total test. 

If we write a sum of terms as K times the average, and divide numera¬ 
tor and denominator by K , we have 



where s 2 is the average item variance, and 


r xg s g is the average item reliability index, and the other terms have 
the same definitions as in equation 21. 
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The item variance ( s g 2 or p g — p g 2 ) approaches zero as p g approaches 
zero or unity, and is a maximum value of .25 when p g = .5. Since the 
values of 8 g 2 vary between zero and .25, the average item variance 
must also be between these limits. The value of s g varies between 0 and 
.5, and the value of r xg between 0 and 1. Thus the reliability index 
must lie between 0 and .5. That is, the average item variance and the 
average reliability index vary within narrow limits; hence these factors 
cannot have much influence on the test reliability unless r xg is near zero, 
in which case the denominator will become small and the reliability will 
be low. On the other hand, K , the number of items, increases uniformly 
with the addition of new items. As can be seen from equation 22, ttye 
effect of this change in K is to move the reliability nearer to unity. The 
number of items is of itself an important determiner of reliability. As 
long as we avoid items that have a very low or negative correlation with 
total test score, the addition of items with low positive correlations will 
usually increase the reliability of the total test. 

j Equations 21 and 22 give the test reliability as functions of 
the item reliability index ( r xg s g ), the item variance (s g 2 ) y and 
the number of items ( K ). 

If the number of items composing the test is fixed, the reliability of 
the test can be increased only by making the average item variance 
smaller or the average item reliability index larger. To make such a 
selection of items graphically, each item is represented by a point, the 
ordinate of which is the item variance ( s g 2 ) and the abscissa of which 
is the reliability index (r xg s g ). In order to maximize the test reliability, 
we must select a subset of points such that the average ordinate is as 
small as possible and the average abscissa is as large as possible. This 
means that the points must be selected from the lower right-hand portion 
of the graph. 

It should be noted that equations 21 and 22 are strictly accurate if 
all the points, that is, all the items in the test, are used. If we consider 
a subset of items that is only a half or a third of the original number of 
items, it is likely that the values of r xg for the total test will be different 
from the values of r xg for the subtest. Thus using equation 21 and the 
values of r xg for the total test will give an over- or an underestimate of 
the reliability of the subtest. However, tests that are matched item for 
item on the basis of both item variance (s g 2 ) and item reliability index 
(r xg s g ) will probably have closely similar reliabilities. A subset of a 
given number of items selected for large reliability index and small item 
variance will have a higher reliability than a test composed of the same 
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number of items that have a small reliability index and a large item 
varianoe. 

Note that, if we desire to select a subset of k items from a total group 
of K items, a completely accurate solution is obtained by using the inter¬ 
item variance-covariance matrix, computing the sum of the diagonal 
elements (2$ g 2 ) and the sum of all the elements for various subsets of 
items, and selecting the one subset of size k that has the highest relia¬ 
bility. However, with current methods of computation, this method is 
considered too laborious to be of practical use. The approximation 
indicated by the use of equation 21 is, however, computationally 
feasible and reasonably accurate if the purpose is to eliminate the 
poorest 10 or 20 per cent of the items. 

Numerous arbitrary indices of the relationship between item and test- 
score have been developed. Adkins (1938) has pointed out that these 
indices may be classified as approximations to some one of three 
statistics: 

1. The item-test correlation. 

2. The slope of the regression of test on item. 

3. The slope of the regression of item on test. 

The first type would be illustrated by the use of various correlation 
coefficients, such as the biserial, the point biserial, or the tetrachoric; 
the second by the use of indices that depend on the mean difference in 
test score between those passing and failing the item; and the third 
by indices dependent on the slope of the item curve (see Ferguson, 1942, 
Finney, 1944, or Turnbull, 1946). Some of the suggested indices are 
attempts to decrease the clerical and machine costs of item analysis 
by using only a part of the data; see, for example, Kelley (1939), Flan¬ 
agan (1939a), and Davis (1946). 

8. Item parameters determining test validity 

Having considered item selection in relation to test mean, variance, 
and reliability, we turn now to the problem of selecting items to maximize 
the validity of the total test score. It is not possible to do this directly 
unless we have information regarding the correlation of each item with 
the appropriate criterion score. In most practical cases it is probable 
that selecting items to increase the reliability of a test will also inci¬ 
dentally increase test validity. Equation 5, Chapter 9, shows that 
increasing test length increases the validity of the test. Increasing test 
length is also an effective means of increasing test reliability as shown 
in equation 22 (of this chapter) and equation 10 (Chapter 8). However, 
special cases have been demonstrated where it is possible to decrease 
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validity while increasing reliability or to increase validity at the expense 
of reliability; see, for example, Cook (1932), Tucker (1946), or Brogden 
(19466). In other words, if no criterion is available it is highly desirable 
to take steps to increase test reliability; however, in laying a theoretical 
foundation for improving test validity it is essential to consider the 
correlation of each item with a criterion. 

Theoretically the problem of maximizing test validity for predicting 
any specified criterion has been solved. We have only to obtain the 
complete interitem variance-covariance matrix and the item-criterion 
covariances, and then solve all multiple correlations or all multiple 
correlations using a specified number of items (equation 55, Chapter 20)* 
Frisch (1934) has described the method for dealing with “complete 
regression systems.” Such methods, however, are generally regarded 
as too laborious for present computational procedures. Several approxi¬ 
mation techniques have been devised as indicated in Chapter 20, sec¬ 
tion 3. All these methods have in common the assumption that the 
best single test (or item) is included in the best two; that the best two 
will be included in the best three, and so on. By such methods we work 
only K — 1 multiple correlations for K items, which is laborious but 
feasible. Such procedures have been described by Horst (19346), Edger- 
ton and Kolbe (1936), Adkins and Toops (1937), Wherry (1940), Toops 
(1941), and Jenkins (1946). However, it would seem that most test 
workers still consider the labor of these methods prohibitive, since they 
have not attained very wide use. It is possible by using additional 
assumptions to develop a less laborious method that makes use of only 
2 K item parameters, namely, a reliability index and a validity index 
for each of the K items of the original experimental test. 

The general formula for the correlation of a criterion with a composite 
is given in equation 1, Chapter 9. Here we will use the subscript y to 
designate the criterion instead of / as in equation 1, Chapter 9. The 
formula for the variance of a sum is given in equation 9, Chapter 7. 
Here we shall use the subscript x to designate the total test, instead of c, 
as in equation 9, Chapter 7. If we change subscript c in equation 9, 
Chapter 7, to x> change subscript I of equation 1, Chapter 9, to y , and 
substitute equation 9, Chapter 7, in equation 1, Chapter 9, we have 

K 

^ PygSgSy 

(23) r xy = -- 

Since s y is the same for all the terms in the summation, it may be 
factored out. If we divide numerator and denominator by s y , ajad 
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substitute equation 20 in equation 23, we have 

K 

X) r Vi S g 

(24) r xy = - 

2 ?xgSg 
1 


If we substitute K times the mean for the sums and divide numerator 
and denominator by K , we have 


(25) 


TygSg 

T X y — - , 

r xgSg 


where the bar over a term indicates the average, 
r xg is the point biserial correlation of item g with the test x, 
r yg is th e point bi serial correlation of item g with the criterion y, 
% is y/ p(l — p), the standard deviation of item g, 
r xy is the correlation between the criterion and test, and 
K is the number of items in the test. 


If r yg s g defined as the “validity index” of item g and r xg s g 
as the “reliability index ” of item g, the test validity is the 
ratio of the sum of the validity indices to the sum of the reli¬ 
ability indices, or the ratio of the average validity index to 
the average reliability index . 

As a practical item selection procedure it is desirable to plot the item 
analysis results. For example, the reliability index may be plotted as 
the abscissa and the validity index as the ordinate (Figure 1); then the 
items should be selected as far as possible from the upper left-hand corner 
of the plot. This method was described and illustrated by Gulliksen 
(1944) and (1949a); see Figure 2. 

This method of selecting items to give a valid test is similar to the 
one suggested by Horst (19366). It is of particular interest to note that 
the number of items in the test has, of itself, no effect on validity. 
However, an increase in number of items will, except under unusual 
circumstances, increase the reliability of the test. If no validity index 
is available, increasing the number of items in a test may well contribute 
to lowering the test validity. 

As mentioned in the introduction to this chapter, it should be noted 
that the methods presented here do not consider sampling errors nor 
the possibility of systematic variation in the item parameters. A subtest 
composed of only a few of the most valid items is probably less likely to 
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maintain its high validity on a new sample of persons than a test com* 
posed of a large number of items. The chance and systematic fluctua¬ 
tions of the various item analysis parameters need to be studied and 
compared for various item analysis methods. 

In using equations 24 or 25 for item selection, we should note that the 
validity index for an item is independent of the effects of item selection. 



Figure 1 . Illustrating plot of validity index and reliability index for item selection. 

(From Gulliksen, 1949a.) 

On the other hand, the reliability index will change as the items compos¬ 
ing the test are changed. This effect need cause no concern if only a 
few of the poorest items are eliminated from the test. However, if we 
wish, for example, to select a test of 100 items from an initial test of 
500 items, it is well to make the selection in two or more stages, as sug¬ 
gested by Horst (19365). If all the item-test correlations are positive 
and high, the selection is not so likely to change the reliability index as 
if there were quite a few items with negative reliability indices that were 
to be eliminated. In such a case the reliability indices should be recal¬ 
culated after the first elimination of items with low and negative reli¬ 
ability indices. 
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As mentioned in conjunction with item selection to control test 
variance and test reliability, it should be noted that, if we consider the 
entire test, the ratio of the average validity index to the average relia¬ 
bility index must equal the test validity. No approximation is involved. 
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Figure 2. Illustrating item selection to maximize test validity. (From Gulliksen, 
1944; OSRD Report 3187; Applied Psychology Panel, NDRC.) 


However, as we make more and more stringent selection of test items, 
correlation of the item with the new subtest is increasingly likely to be 
different from the correlation of the item with the original total test. 
The item selection introduces no error whatever into the numerator 
term of equation 24. The error made in estimating the validity coeffi¬ 
cient is due solely to the fact that the correlation of item with total test 
will vary as the test length changes. Hence, as mentioned before, if a 
computationally feasible method of utilizing the interitem variance- 
covariance matrix were developed, it would be possible to select any 
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subset k from a larger set of K items and to determine precisely the 
variance, reliability, and validity of that subset of items. 

9. Computing formulas for item parameters 

From the theory given in the preceding sections, the essential item 
statistics are: 

1. p g , the proportion of persons answering each item correctly. This 
quantity is a measure of item difficulty. From it, the item variance 
s g 2 = p g (l — p g ) can readily be computed. 

2. r xg 8 g , the reliability index, which is the point-biserial correlation 
between item and total score multiplied by the item standard 
deviation. 

3. r V gS gJ the validity index, which is the point-biserial correlation 
between item and criterion score multiplied by the item standard 
deviation. 

Having determined the item parameters that are related to test 
mean, variance, reliability, and validity, we turn to the problem of 
computing these values. 

We shall not consider here short-cut methods of estimating these 
parameters from a portion of the data. The principal purpose of these 
methods is to avoid the clerical labor involved in dealing with all the 
data; hence they can be compared only on the basis of computing costs 
and statistical precision. As yet such comparisons have not been made. 
For a description of such methods, see Kelley (1939), Flanagan (1939a), 
and Davis (1946). 

The item difficulty measure requires simply a count of the number 
of correct answers to each item. This count may be made manually or, 
if punched-card equipment is available, the count may be made with 
the counting sorter or the tabulator. Usually the count is obtained 
incidentally in connection with the computation of the point-biserial 
correlation or the reliability index. 

When some of the persons taking a test fail to answer certain items, 
we have the problem of how to treat such responses. As indicated in 
Chapter 17, if we are dealing with a speed test all the items must be 
easy so that the only purpose of an item analysis is to eliminate items 
with a significant proportion of errors. In a power test, the number of 
items left blank, either skipped or unattempted, should be negligible. 
An adequate theoretical analysis of a test that is a mixture of speed and 
power has not yet been presented. Such an analysis probably requires 
some information or assumptions about the correlation between speed 
and power. 
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The analysis given here applies strictly only to a power test that has 
an ample time limit so that practically all the items are attempted. 
The discussion of Chapter 17 indicates objective criteria for the possible 
influence of number of blank items. For analyzing a power test with a 
large number of unattempted items, some of the methods given in sec¬ 
tion 3 of this chapter should probably be used. 

The derivation showing the relationship of item parameters to test 
reliability or validity, used the Pearson product-moment correlation of 
the item with the test or criterion score. The raw score for an item is 
unity if the item is answered correctly, and zero if the item is answered 
incorrectly. Let us begin with the formula for correlation in terms of 
summations and make the simplifications appropriate for this particular 
case. Since the formula for the item-criterion correlation is identical 
with that for the item-test correlation, we shall consider in detail only the 
correlation between a dichotomously scored item and the test score X. 

Equation (26) 

N N N 

N Z A ig Xi - Z A is Z 

t = l 1 = 1 1 = 1 

f SE — . . ■ ■ ■ — .— - . - - ■ -- ■■ . . ■ - . 

Xg I N ~7~N \ 2 I N / N \ 2 

yjN Z A ig 2 ~ (Z A ig ) ^N Z Xi 2 - (Z A\J 

N 

23 A^Xi may be simplified by noting that A is either unity or zero; 

*'« l 

hence the sum of products is equal to the sum of the test scores for those 

Ng 

who answer the item correctly. This sum may be designated as ^ X{ g . 

»—i 

Let us define N g as the number of persons answering item g correctly 
and Xf.as the average test score for those who answer item g correctly. 
From these definitions and equation 4 we may write 

N 

(27) N g = Z At, - Np g 
and 

N N g 

( 28 ) z A ig Xi = Z X ig = N g X g = Np t X g . 

»-l i»l 

From the definition of a standard deviation, 

t —v T5 

N$g 8=8 Ai g 2 — ^23 Aigj . 


(29) 
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Substituting equations 27, 28, and 29 in equation 26 and multiplying. 
both sides by s g gives the item reliability index in terms of gross score 
summations as 


(30) 


r xg s g ~ 


N g N 

N E X« -N'ZXi 

i=l _i==l_ 

I N ~7~N T2 

N r£ x ‘-(s x <) 


The reliability index may also be written in terms of means, a propor¬ 
tion, and a standard deviation. From the definitions of a mean (X) 
and a standard deviation (X) we have 4 


(31) 


N 




= NX, 


and 


NX = Ns x = 




Substituting equations 28 and 31 in equation 30, dividing numerator 
and denominator by N 2 , and factoring out p g , we have 


(32) 


r xg s g — Pg 



If Y is used to designate gross scores on the criterion measure, by 
substituting Y for X in equations 30 and 32, we have the corresponding 
formulas for the item validity index. 


(33) 


and I 
(34) 


r yg s g — 


N g N 

N E Yt* ~ N' E Y t 

• 1=1 1=1 


1 N 

/ N \ 2 

N JN E Y? - 

\ t=l 

(s r ) 


r ve s t = ,( % -)• 


In equations 30, 32, 33, and 34: 

N is the total number of persons taking the test, 

N g is the number of persons answering item g correctly 

fo -1 •••*). 

p g is Ng/N, the proportion of persons answering item g 
correctly, 
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Xi and Yi designate, respectively, the test and criterion score for 
individual i (i = 1 • • • N), 

X and X are, respectively, the mean and the standard deviation of 
the total distribution of test scores, 

F and F are, respectively, the mean and the standard deviation of 
the distribution of criterion scores, 

X g and Y g designate, respectively, the test and the criterion score 
for each person who answers item g correctly, 

X g and Y g are the average test and criterion scores, respectively, for 
those answering item g correctly, 
r xg $ g is the reliability index for item g, and 
r yg Sg is the validity index for item g. 

The reliability and validity index for each item can be computed if 
we have the mean and the standard deviation of all N persons for both 
the test and the criterion, p g , the proportion of persons answering each 
item correctly, and the average test and average criterion score for those 
answering each item correctly. 

The formulas given here for the point-biserial correlation are a slight 
variant of those presented by Richardson and Stalnaker (1933) and 
Stalnaker (1940). 

Equations 80, 32, 83, and 84 are the basic computing for¬ 
mulas to be used in calculating the reliability index and the 
validity index for a group of items. They are analogous ex¬ 
cept for the factor Y or X to the formulas presented by Horst 
(1936b). 

10. Summary of item selection theory 

The basic theoretical problem for item analysis procedures is to find 
a functional relationship between the parameters of the total test and 
appropriately selected item parameters. Such a theory must take due 
account of important changes in methods of test scoring. It is then 
necessary to investigate various factors that produce variation in these 
item parameters, such as random sampling error and systematic varia¬ 
tion produced by changes in such factors as the length of the test and 
the heterogeneity of the group. Various computational short-cut 
procedures utilizing only a portion of the data can also be studied to 
determine which method is most economical. In making such compari¬ 
sons it is necessary to adjust the sample size so that the statistics com¬ 
pared will have the same sampling fluctuation. 

In the foregoing sections an item analysis rationale has been proposed 
for the case in which the test score is the number of items answered 
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correctly. It has been shown that the test mean, standard deviation,, 
reliability, and validity may be estimated from three item parameters, a 
difficulty, reliability, and validity index. The equations are as follows; 

K 

(5) M x or X = 23 Pg = K P, 

g=i 


K _ 

(20) s x or X — ^ Txg s g ~ K{r xg Sg) f 

g~ i 



g=i 


In these equations: 

K is the number of items in the test, 

N is the total number of persons taking the test, 

N g is the number of persons answering item g • correctly 
(0 = 1 •••#), 

p g is the proportion of persons answering item g correctly 
(that is, p g =* N g /N)> 

a g 2 is the variance of item g [s g 2 = p g ( 1 — p g )], 
r xg Sg, the item reliability index, is the point-biserial item-test 
correlation multiplied by the item standard deviation, 
r vg s g , the item validity index, is the point-biserial item-criterion 
correlation multiplied by the item standard deviation, 
Mx or X is the mean for all scores in the test distribution, 
s x or X is the standard deviation of the distribution of test scores, 
r xx is the reliability of the test, and 

r xy is the test validity, the correlation of test (x) with the 
criterion ( y ). 
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Computing formulas for the item reliability and validity indices were 
given as 


Equations (30 and 32) 


r X gSg — 


N 2 - N t Z X,- 

t=l _t=l_ 

I n 7~v Ti 

-(£*•) 


Equations (33 and 34) 

N g N 

N E Y ig -N g Z Yi 

i=l i=l 

TygSg — 


I W 71v v2 

N rS r ‘-(Zr.) 


In these formulas: 



X or £ designates the test, 

Y or y designates the criterion, 

F and F are, respectively, the mean and the standard deviation of 
the criterion scores, 

X g designates the test score only for those persons who have 
answered item g correctly, 

Y g designates the criterion score only for those who have 
answered item g correctly, and 

X g and Y g are the average test and criterion scores, respectively, for 
those answering item g correctly. 

The other terms have the same definition as in equations 24 and 25. 

One systematic error in the foregoing formulas arises from the fact 
that the item reliability index is not invariant with respect to test length. 
In most cases the item-test correlation will increase as the test length 
increases. The item difficulty and validity indices are not affected by 
test length. All three indices are affected by a change in the ability 
level of the group. This means that the item parameters must be ob¬ 
tained on a group similar to that for which the test is being constructed. 
The item parameters will be more generally useful if it is possible to 
discover parameters that do not vary systematically with changes in 
the mean or variance of group ability. If. such parameters cannot be 
found, it may be possible to make some empirical estimations of the 
amount of change that may be expected in the item parameters as a 
result of a given change in the group. 
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In addition to systematic changes of item parameters with group ability 
and with test length, the item statistics are subject to random sampling 
variation. The magnitude of such fluctuations should be determined 
in order that we may estimate the change in test parameters to be 
expected when the test is used on a new sample. These sampling errors 
could also be used to determine the size for the item analysis sample 
that is necessary to give reasonable sampling stability in the test param¬ 
eters. 

Numerous arbitrary indices of item difficulty and reliability have 
been given in the item analysis literature. The attempts to express item 
difficulty in terms of an ability level at which the item will be answered* 
correctly by half the persons are interesting in that one of them may 
give a difficulty index that does not vary systematically with changes 
in the ability level of the group. Horses difficulty index, which estimates 
the number of persons knowing the answer to the item, may also offer 
some interesting possibilities for test construction and scoring. 

A large number of arbitrary indices of item reliability or homogeneity 
have been reported in the literature. Adkins (1938) has shown that 
these indices may be classified as estimates of (1) the item-test correla¬ 
tion, (2) the regression of item on test, or (3) the regression of test on 
item. The regression of item on test should be invariant with respect 
to selection on the basis of test score. 

Many of the item reliability indices make use of only a portion of the 
data and estimate a correlation or a slope from widespread classes. As 
far as the author is aware, the efficiency of these methods has not been 
compared with methods using the entire sample, when sample size is 
adjusted so as to secure equal sampling errors. 

11. Prospective developments in item selection techniques 

In considering the subsequent development of item analysis proce¬ 
dures, there are a number of problems to which special attention should 
be called. For the special case of tests for which the score is the number 
of correct answers we have several unsolved problems. What are the 
appropriate item selection procedures for controlling the skewness or 
the kurtosis of the distribution of total test scores? The development 
of such procedures will probably present more difficulties than the prob¬ 
lem of maximizing reliability or validity, since we should usually be 
interested in arriving at some intermediate point, such as zero skew or 
normal kurtosis. This would require much more accurate estimation 
than obtaining the highest reliability or validity possible with a given 
set of items. 

A basic assumption in developing the theory of the influence of group 
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heterogeneity (see Chapters 10 and 11) is that the error of measurement 
does not vary systematically with test score. It is likely that this is 
true for some types of item selection and not for others. How can 
items be selected to keep the error of measurement constant at different 
points on the scale? How can items be selected to make the error of 
measurement smallest at a prospective cutting score? Since the varia¬ 
tion of the error of measurement with test score depends on the third 
and fourth moments, Mollenkopf (1948, 1949), the theoretical analysis 
of the item selection procedures offers some difficulties that have not 
yet been surmounted. However, since the error of measurement is a 
fundamental statistic for a test, it will be a distinct advance when item 
selection techniques can selectively control the error of measurement 
for different test scores. 

In this chapter the theoretical analysis of item analysis procedures 
has been presented only for the special case of the number right score. 
A corresponding analysis of the relationship between the item parameters 
and the test parameters is needed for other types of test scoring pro¬ 
cedures. For example, a different type of item analysis is appropriate 
if the score is on the basis of level reached as in the absolute scaling 
methods (Thurstone, 1925 or 19276), or as in the scaling methods 
developed by Guttman or the latent structure methods of Lazarsfeld; 
see Social Science Research Council (1950), Vol. IV. 

As pointed out repeatedly in Psychometric Methods (Guilford, 19366), 
the persons developing test theory have been for the most part un¬ 
acquainted with, or have ignored, the work in psychophysics. There 
has been some attempt to develop a theory relating the two fields 
(Mosier, 1940 and 1941), but the psychophysical techniques, as de¬ 
veloped by Thurstone (1925, 19276), have not been systematically 
applied in the large-scale practical work in aptitude or achievement 
testing. Some exploratory work in this field has been done by Gross- 
nickle (1942) and Lorr (1944). The integration of psychophysical 
theory and test theory would be a major achievement. 

Another set of important item analysis problems deals with the nature 
of the changes in item parameters with changes in the test group. A 
significant contribution to item analysis theory would be the discovery 
of item parameters that remained relatively stable as the item analysis 
group changed; or the discovery of a law relating the changes in item 
parameters to changes in the group. 

Relatively little experimental or theoretical work has been done on 
the effect of group changes on item parameters. If we assume that a 
given item requires a certain ability (A), the proportion of a group 
answering that item correctly will increase and decrease as the ability 
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level of the group changes. The amount of this change will be greater 
for an item that is highly correlated with ability A than for one that 
correlates only moderately with ability A. If we have some standard 
measure of ability A, it may be that the ability level at which 50 per 
cent pass and 50 per cent fail would not be subject to as much fluctuation 
as the proportion of correct responses. As yet there has been no sys¬ 
tematic theoretical treatment of measures of item difficulty directed 
particularly toward determining the nature of their variation with 
respect to changes in group ability. Neither has the experimental work 
on item analysis been directed toward determining the relative invariance 
of item parameters with systematic changes in the ability level of the * 
group tested. 

A similar problem of invariance is encountered in considering measures 
of the relationship between an item and the total test score or the 
criterion score. For example, the reliability index presented in this 
chapter involves the point biserial correlation. This coefficient varies 
systematically with item difficulty, Carroll (1945), Ferguson (1941a), 
and Gulliksen (1945), and consequently will vary with the ability level 
of the group tested. Theoretically there is no such systematic bias in 
biserial correlation. The biserial correlation should not change as the 
item difficulty changes with variations in group ability level. However, 
the data given by Richardson (1936a) showed systematic changes in 
biserial correlation with changes in ability level of the group. It might 
be found that some statistic related to the error of measurement or the 
slope of the regression line would turn out to be relatively stable despite 
changes in the mean and the standard deviation of ability in the group 
tested. If such a statistic were developed and used, then in constructing 
any test it would be necessary to have information on the ability range 
to be tested in order to construct a suitable test from the items available. 
As is true for item-difficulty parameters, we do not have the appropriate 
theoretical and experimental investigations showing how different item- 
test correlation measures vary with changes in the average and standard 
deviation of ability of the group tested. 

The discussion in the foregoing paragraph applies both to item-test, 
and item-criterion correlations. There is one additional factor affecting 
item-test correlations that does not influence item-criterion correlations. 
The length of the test of which the item is a part will affect the item-test 
correlation but cannot influence the item-criterion correlation. For very 
short (two or three items) tests, the item score will form a considerable 
fraction of the test score; hence the item-test correlation will at first 
tend to decrease as items are added to the test. For tests larger than 
fifty or a hundred items, this effect is negligible; and, as the test length 
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increases, a slight increase in item-test correlation could be expected 
because of the decrease in the error component of the total test score as 
test length is increased. Again the appropriate theoretical and experi¬ 
mental investigations are lacking. It is probable, however, that some 
conditions regarding a minimum number of items for a subset could be 
found so that we might say that neither of these factors is serious as 
long as we consider subsets of no less than, for instance, fifty items. 

In addition to the problems of the relationship between item param¬ 
eters and test parameters, and the nature of the variation of item 
parameters with changes in other factors such as the length of the test 
and the ability of the group, we have the problem of the most efficient 
statistics to use in estimating these parameters. A complete treatment 
of this problem would include both statistical efficiency in the sense of 
reducing the sampling error of the statistic, and cost efficiency in the 
sense of reducing the labor and machine costs of computation. In com¬ 
paring different methods for an over-all determination of efficiency, it 
is necessary to adjust the number of cases for each method so as to 
equalize the sampling error, and then compare the costs of dealing with 
these appropriately adjusted numbers of cases. 


Problems 

1. Assume that published data give the biserial correlation between each item and 
the total test or the criterion score. Give the formula for changing biserial correlation 
into the reliability or the validity index discussed here. 

2. Show the relationship between the method of improving test validity presented 
by Horst (19366), “Item Selection by Means of a Maximizing Function” ( Psycho- 
metrika, —) and the method presented here. 

3. Study the material in Guilford, Psychometric Methods, pages 434 and 435, on 
Cook’s index B and Clark’s index. Compare these two indices. 

4. The following item analysis information is available on a 35-item test. Which, 
items should be eliminated in shortening the test to 30 items? 



Chap. 21] 


Item Analysis 


395 


Item Analysis Information for Use with Problem 4 

(These data were furnished through the courtesy of Dr. W. G. Mollenkopf 
of the Educational Testing Service.) 


Item 

Number 

Proportion 

Answering 

Item 

Correctly 

Standard 
Deviation 
of Item 

Point Biserial 
Correlation of 
Item with 

Relia¬ 

bility 

Index 

Validity 

Index 

4 

Total 

Test 

Score 

Criterion 

Score 


Vg 

h - Vp, - p * 2 

r xg 

r v& 

r xg s g 

r yg 8 g 

1 


.400 

.280 

.203 

.112 


2 

.814 

.389 

.393 

.152 

.153 


3 

.731 

.443 

.142 

.126 

.063 

1Km9 

4 

.807 

.395 

.256 

.327 

.101 

.129 

5 

.241 

.428 

.409 

.168 

.175 

.072 

6 

.379 

.485 

.266 

.188 

.129 

.091 

7 

.717 

.451 

.233 

.186 

.105 

.084 

8 

.772 

.420 

.200 

.200 

.084 

.084 

9 

.634 

.482 

.203 

.212 

.098 

.102 

10 

.559 

.497 

.237 

.213 

.118 

.106 

11 

.641 

.480 

.375 

.273 

.180 

.131 

12 

.621 

.485 

.400 

.291 

.194 

.141 

13 

.241 

.428 

.285 

.313 

.122 

.134 

14 

.441 

.497 

.270 

.245 

.134 I 

.122 

15 

.324 

.468 

.385 

.239 

.180 I 

.112 

16 

.628 

.483 

.290 

.157 

.140 

mm 

17 

.138 

.345 

.287 

.165 

.099 


18 

.483 

.500 

.414 

.232 

.207 

.116 

19 


.296 

.253 

.267 

.075 




.498 

.301 

.309 

.150 

.154 



.474 

.441 

.255 

.209 

.121 



.449 

.434 

.165 1 

.195 

.074 


.667 

.471 

.193 

.208 

.091 

■(mm 


.457 

.498 

.327 

.207 

.163 

mmm 


.435 

.496 

.278 

.222 

.138 

mm 

26 

.378 

.485 

.227 

.196 

.110 

.095 

27 

.149 

.356 

.213 

.247 

.076 

.088 

28 

.305 

.460 

.143 

.191 

.066 

.088 

29 

.320 

.467 

.435 

.330 

.203 

.154 

iiKiBi 

.545 

.498 

.143 

.118 

.071 

.059 

31 * 

.560 

.496 

.381 

-.008 

.189 

-.004 

32 

.356 

.479 

.476 

.194 


.093 

33 

.713 

.452 

.257 

.150 

.116 

.068 

34 

.558 

.497 

.366 

.215 

.182 

.107 

35 

.308 

.462 

.266 

.182 

.123 

.084 
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APPENDIX A 


Equations from Algebra, Analytical Geometry, 
and Statistics, Used in Test Theory 

Elementary algebra assumed for test theory 
Expansion of binomials: 

(x + y) 2 = x 2 + 2 xy + y 2 

(x - yf = x 2 - 2 xy + y 2 

. n(n - 1) n\ 

{x + y) n = x n + nx'-'y + ——- x n ~ 2 y 2 + • • • + —-- x n ~Y 

2 r!(n - r)l 

+ • • • + nxy n ~ x + y n 

Factorial notation: 

n! = n(n — l)(n — 2) ■ • • (3)(2)(1) 

Expansion of polynomials: 

(o + b 4* • • • + y + z) 2 = a 2 + b 2 4- • • • + y 2 + z 2 

+ 2aH-H 2ay -h 2cwr •+•-** 

+ 2by + 2bz + • • • + 2 yz 

(a b -|- • • • 4" y “I" x)(A -|- B 4* • • • 4" Y 4* Z) 

= aA 4" aB -f* • • • 4“ oY 4" a,Z + bA 4* bB 4* • • • 4* bY 4 - bZ • 

+ yA 4* yB + • • • 4" yY 4* yZ 4" zA 4* zB + • • • + zY 4 - zZ 
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The Theory of Mental Tests 
Solution of simultaneous linear equations: 

ax + by =* c 
dz + ey = / 
c b I 

/ el ce — 6/ 

CL b O 0 C bd 

d e 

a c 

d f af — cd 

a 6 1 ae — bd 

d e I 

The solution of a quadratic equation: 

ax 2 + bx + c = 0 

—6 =b y/b 2 — 4ac 

x = - 

2a 

Analytical geometry assumed for test theory 
Equation of the straight line: 

y = ax + 6 

where a is the slope of the line, and 
6 is the intercept on the y-axis. 

Equation of a circle with its center at the origin and radius r: 

x 2 + y 2 = r 2 

General equation of a circle: 

(x — a) 2 -f (y — &) 2 = r 2 

where r is the radius, 

a is the abscissa value for the center, and 
6 is the ordinate for the center. 

Equation of a hyperbola with asymptotes x = 0 and y = 0: 



xy = c 
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Equation of hyperbola with asymptotes x = a and y » b: 

(x - a)(y - b) « 0 

Equation of hyperbola with asymptotes xVo ± y Vb = 0: 

ax 2 — by 2 = c 

Equation of an ellipse: 

ax 2 + by 2 = c 

Equation of a parabola: 

y = ox 2 + bx + c 

The treatment of conics may be generalized as follows: 

A'x' 2 + B’x'y' + C'y' 2 + Z)V + E'y' + F f = 0 (General equation of 

the second de¬ 
gree represents 
any conic section.) 

x / = x CO s 0 — y sin <t> (Transformation used to change the general 

equation of the second degree to a standard 
y' = x sin <t> + y cos <t> f orm> ) 

B' 

where tan 2<f> — —-— 

A* — C' 

sin = 


cos $ = 

By use of the foregoing transformation any equation of the second 
degree can be rotated to the following standard form, such that the 
coefficient of the xy term is zero. 

Ax 2 + Cy 2 + Dx + Ey + F = 0 (Standard form for the general 

second-degree equation. Repre¬ 
sents any conic section [with axes 
parallel to the coordinate axes].) 

If A and C have unlike signs, this equation represents a hyperbola 
with axes parallel to the coordinate axes. 

If A and C have the same sign, this equation represents an ellipse 
with axes parallel to the coordinate axes. 
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If A equals C, this equation represents a circle. 

If A or C equals zero , this equation represents a parabola with its 
axis parallel to a coordinate axis. 

If A equals C equals zero , this equation represents a straight line. 

Elementary statistics assumed in test theory 

Xi = gross score of individual i 
N = number of persons in sample 




X or Mx = 


(mean) 


Xi = Xi — Mx 


Xi = 0 


(deviation score) 
(mean deviation score) 


£ ZXi 2 


— Mx 2 (variance of a specified 

sample) 


N - 1 


(estimate of variance in 
universe from which 
sample is drawn) 


See Yule and Kendall (1940), pages 434-436, for a discussion of the 
use of N and N — 1 in the denominator of the formula for variance. 

N N 

X) x i 2 “ 2 X 2 — NMx 2 (gross score formula for sum of 

squares of deviations from the 
mean) 


S xy S xy 

Txv " N 8x s y " VSx 2 Sy 2 

sxr - NMxMy 

'XX 2 - NM X 2 Vsy 3 - NMy i> 


(deviation score formula for 
correlation) 

(gross score formula for 
correlation) 




Xxy XXY - NMxMy 


(covariance) 
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V = r xy — x (deviation score formula for re- 

8x gression of y on x used to es¬ 

timate y from x) 

• Sy Sy 

Y — r xy — X + My — r xy — Mx (gross score formula for 
Sx Sx regression of Y on X ) 

s x 

x = r xy — y (deviation score formula for 
Sy regression of x on y used 

to estimate x from y) * 

• Sj S x 

X - r xy -Y + Mx — r xy — My (gross score formula for 
Sy Sy regression of X on Y) 

s y , x = s y v 1 — r X y (error of estimate, error made 

in estimating y from x) 

s x , y = s x y /1 — r X y (error of estimate, error made 

in estimating x from y) 

— r x %r z 

= —r—r = — - (partial correlation, the correla- 

VI — r yz tion between x and y for a 

constant value of z) 


S x .u — Sx\/ 1 T x 


• xy ' xz> yz 


Trn 

1 • xz 


'l _ r 2 

* • liZ 


s x _ y = s x 2 + s y 2 — 2 r xy s x s y (standard deviation of 

a difference) 

s x + y = s x 2 + s y 2 + 2r xy s x s y (standard deviation of 

a sum) 


s x -y = + Sy 2 = (standard deviation of a sum or 

difference for the special case 
of zero correlation) 

Ai g = 1 or 0 (score of individual i on item g) 

N 

JL* 

= —- (proportion of individuals answer- 

^ ing item g correctly) 

q g =* 1 — p g (proportion of individuals answer¬ 
ing item g incorrectly) 

8g 2 =® p t q e « p g — p g 2 (variance of item g) 
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%g+ — 


N 


N 

E XiA it 
1 

Np g 


(average test score for persons 
answering item g correctly) 




£ ^.(i - A*,) 

W. 


- (average test score for persons 
answering item g incorrectly) 

z g = normal curve ordinate corresponding to the area indicated by 
Vi or q e 

N 

Nz g = — e<-x!/2<r * ,) (ordinate of normal curve) 


where N = number of cases, 

<r x = standard deviation of distribution, 
v = 3.1416, and 

e = 2.7183 = lira ( 1 + - ) 

» -»« \ nt 


bia^xg 


bia r xg 



(equivalent formulas for 
biserial correlation of 
item g with score x) 



(equivalent formulas for point- 
biserial correlation of item g 
with score x) 


Use of the summation sign 

If k represents a constant, and x f y f z, and w represent variables, the 
major principles in the use of the summation sign may be indicated in 
the following equations. Since all summations are assumed to be over 
persons, the subscripts and limits are not given. 

2(* + y) = 2a; + Xy (The sum of x + y is equal to summation x 

plus summation y.) 

2(x — y) — 2x — 2y (The sum of a set of differences is equal to the 

difference of the two sums.) 
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2 kx = hZx (The sum of a constant k times a variable x is equal to 
the constant multiplied by summation x.) 

2 k = Nk (The sum of a constant term is equal to N times the 
constant term.) 

Combinations of these principles with elementary algebra is illustrated 
in the following equations. 

2k(x + y) = fcSx + kXy 

2 (kx + y)(z + w) = fcS xz + 2 yz + 2 yw + feS xw 

The score matrix 


X n 

Xl 2 

X 13 ■ 

•• x 1K 

x 21 

x 22 

X 23 • 

•• X 2K 

X 3l 

x 32 

x 33 

•• * 3 * 

x N1 

Xff 2 

Xn 3 • 

• • Xnk 


The foregoing matrix represents the scores of N persons on each of 
K tests. The first subscript designates the persons (from 1 to N); and 
the second subscript designates the tests (from 1 to K). The scores 
in any given column are the scores of all the persons on one test, and the 
scores in any row are the scores of one person on all the tests. 

The general term in this matrix may be written 

X* (i- 1 •••#;*« I —JQ 


Xig indicates the score of the ith person on the 0 th test. The notation 
in parentheses shows that i varies from 1 to N and g from 1 to K. 

The mean of any particular test (g) is written as follows in the double 
subscript notation: 

N 

X) Xig 



The period is used to indicate the position of the subscript over which 
we have summed. This is read: The mean of the 0 th test is equal to 
the summation of X sub i g for test g from i equals 1 to i equals N, di¬ 
vided by the number of persons. 
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Using this same double subscript notation, it is possible to express 
the average score on all the tests for any one person. This is written 

K 


Mi. = 


EX* 

g-1 

K 


which is read: The average score of the tth person equals the sum of 
the scores (X*) for that person, from g equals 1 to g equals K, divided 
by K (the number of tests). The period is used in place of subscript g 
since summation was over this subscript. 

The average score of all persons on all the tests would be expressed 
with a double summation notation, 

N K 


M.. 


E E x* 

*—1 g=»l 


NK 


This is read: M is defined as the summation of Xi g with respect to g 
from g equals 1 to g equals K , summed with respect to i from i equals 
1 to i equals N, divided by N times K. 

Wherever we are dealing with several persons and tests, the double 
subscript notation is desirable to avoid ambiguity. If no ambiguity 
arises, it is permissible to omit the subscripts after X , and also to omit 
the limits above and below the summation sign. 

It should be noted that the matrix of scores is not symmetric. The 
score of the second person on the third test is different from the score 
of the third person on the second test. 

(X23 5 *^ X32) 

On the Other hand, the variance-covariance matrix or the intercorrelation 
matrix is symmetric. The correlation (or covariance) of test 2 with 
test 3 is identical with that of test 3 with test 2. 


The correlation matrix 


Si 2 

ri 2 SiS 2 

^138183 

• • • riK8i8K 

ri 2 «i «2 

« 2 2 

r 23 s 2 s 3 

• • • r 2 /cs 2 sjc 

fl 3 «lS 3 

r 23 S 2 S 3 

8 3 2 

• • • r 3 K838K 

TlKSlSK 

r2K82«K 


... 8l f 
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The foregoing matrix shows the variances and covariances for a set of 
K tests or items. The variance of the sum of the tests (or items) is the 
sum of the terms in this variance-covariance matrix. This sum may be 
written in several different ways, 

K K 

Z Z UhSgSh (r gg = r hh = 1) 

g«l A=1 

In order to show explicitly the difference between the variances and 
covariances, we may write 

K k k * 

Z s e 2 + Z Z W fS A 

g=i g=i 1 

Sometimes the second term is written without the two summation 
signs and the upper limit used to designate the number of terms as 
follows: 

K K*-K 

y 'j Sg 2 + y i rghSgSh 

g=l g^.fc=l 

With this notation it is understood that, since the terms where g = h 
are omitted, there are K 2 — K terms in the second summation. Since 
the terms above the principal diagonal are identical with those below 
it, this sum for a symmetric matrix is sometimes written 

K K -1 

Zs* 2 +2Z TghSgSh 

g -1 g>h -1 
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Table of Ordinates and Areas 
of the Normal Curve 
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TABLE OF THE NORMAL CURVE 

Ordinates (*) and Cumulative Area ( A ) of the Right Half of the Normal 
Curve of Distribution of Unit Area 
For cumulative of whole curve, read .5 ± A for =t x/<r. Ordinates are represented 
in terms of the total area as unity. 


x/<r 

* 

A 

x/o’ 

z 

A 

0.00 

0.39894 

0.00000 

0.50 

0.35207 

0.19146 

0.01 

0.39892 

0.00399 

0.51 

0.35029 

0.19497 

0.02 

0.39886 

0.00798 

0.52 

0.34849 

0.19847 

MI 

0.39876 

0.01197 

0.53 

0.34667 

0.20194 

■ it 

0.39862 

0.01595 

0.54 

0.34482 

0.20540 

HwM 

0.39844 

0.01994 

0.55 

0.34294 

0.20884 


0.39822 

0.02392 

0.56 

0.34105 

0.21226 

■XttXti 

0.39797 

0.02790 

0.57 

0.33912 

0.21566 

0.08 

0.39767 

0.03188 

0.58 

0.33718 

0.21904 

0.09 

0.39733 

0.03586 

0.59 

0.33521 

0.22240 

0.10 

0.39695 

0.03983 

0.60 

0.33322 

0.22575 

0.11 

0.39654 

0.04380 

0.61 

0.33121 . 

0.22907 

0.12 

0.39608 

0.04776 

0.62 

0.32918 

0.23237 

0.13 

0.39559 

0.05172 

0.63 

0.32713 

0.23565 

0.14 

0.39505 

0.05567 

0.64 

0.32506 

0.23891 

0.15 

0.39448 

0.05962 

0.65 

0.32297 

0.24215 

0.16 

0.39387 

0.06356 

0.66 

0.32086 

0.24537 

0.17 

0.39322 

0.06749 

0.67 

0.31874 

0.24857 

0.18 

0.39253 

0.07142 

0.68 

0.31659 

0.25175 

0.19 

0.39181 

0.07535 

0.69 

0.31443 

0.25490 

0.20 

0.39104 

0.07926 


0.31225 

0.25804 

0.21 

0.39024 

0.08317 

0.71 

0.31006 

0.26115 

0.22 

0.38940 

0.08706 

0.72 

0.30785 

0.26424 

0.23 

0.38853 

0.09095 

0.73 

0.30563 

0.26730 

0.24 

0.38762 

0.09483 

0.74 

0.30339 

0.27035 

0.25 

0.38667 

0.09871 

0.75 

0.30114 

0.27337 

0.26 

0.38568 

0.10257 

0.76 

0.29887 

0.27637 

0.27 

0.38466 

0.10642 

0.77 

0.29659 

0.27935 

0.28 

! 0.38361 

0.11026 

0.78 

0.29431 

0.28230 

0.29 

0.38251 

0.11409 

0.79 

0.29200 

0.28524 

0.30 

0.38139 

0.11791 


0.28969 

0.28814 

0.31 

0.38023 

0.12172 

0.81 

0.28737 

0.29103 

0.32 

0.37903 

0.12552 

0.82 

0.28504 

0.29389 

0.33 

0.37780 

0.12930 

0.83 

0.28269 

0.29673 

0.34 

0.37654 

0.13307 

0.84 

0.28034 

0.29955 

0.35 

0.37524 . 

0.13683 

0.85 

0.27798 

0.30234 

0.36 

0.37391 

0.14058 

0.86 

0.27562 

0.30511 

0.37 

0.37255 

0.14431 

0.87 

0.27324 

0.30785 

0.38 

0.37115 

0.14803 

0.88 

0.27086 

0.31057 

0.39 

0.36973 

0.15173 

0.89 

0.26848 

0.31327 

0.40 

0.36827 

0.15542 

0.90 

0.26609 

0.31594 

0.41 

0.36678 

0.15910 

0.91 

0.26369 

0.31859 

0.42 

0.36526 

0.16276 

092 

0.26129 

0.32121 

0.43 

0.36371 

0.16640 

0.93 

0.25888 

0.32381 

0.44 

0.36213 

0.17003 

0.94 

0.25647 

0.32639 

0.45 

0.36053 

0.17364 

0.95 

0.25406 

0.32894 

0.46 

0.35889 

0.17724 

0.96 

0.25164 

0.33147 

0.47 

0.35723 

0.18082 

0.97 

0.24923 

0.33398 

0.48 

0.35553 

0.18439 

0.98 

0.24681 

0 33646 

0.49 

0.35381 

0.18793 

0.99 

0.24439 

0.33891 


Reprinted by permission from Business Statistics , by George R. Davies and Dale 
Yoder, Second Edition, pages 582-585. New York: John Wiley and Sons, Inc. 
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TABLE OF THE NORMAL CURVE —Continued 


x/<r 

2 

A 

x/a 

z 


1.00 

0.24197 

0.34134 

1.50 

0.12952 

0. 

1.01 

0.23955 

0.34375 

1.51 

0.12758 

0. 

1.02 

0.23713 

0.34614 

1.52 

0.12566 

0. 

1.03 

0.23471 

0.34850 

1.53 

0.12376 

0. 

1.04 

0.23230 

0.35083 

1.54 

0.12188 

0. 

1.05 

0.22988 

0.35314 

1.55 

0.12001 

0. 

1.06 

0.22747 

0.35543 

1.56 

0.11816 

0. 

1.07 

0.22506 

0.35769 

1.57 

0.11632 

0. 

1.08 

0.22265 

0.35993 

1.58 

0.11450 

0. 

1.09 

0.22025 

0.36214 

1.59 

0.11270 

0. 

1.10 

0.21785 

0.36433 

1.60 

0.11092 

0. 

1.11 

0.21546 

0.36650 

1.61 

0.10915 

0. 

1.12 

0.21307 

0.36864 

1.62 

0.10741 

0 

1.13 

0.21069 

0.37076 

1.63 

0.10567 

0. 

1.14 

0.20831 

0.37286 

1.64 

0 10396 

0. 

1.15 

0.20594 

0.37493 

1.65 

0.10226 

0. 

1.16 

0.20357 

0.37698 

1.66 

0.10059 

0. 

1.17 

0.20121 

0.37900 

1.67 

0.09893 

0. 

1.18 

0.19886 

0.38100 

1.68 

0.09728 

0. 

1.19 

0.19652 

0.38298 

1.69 

0.09566 

0. 

1.20 

0.19419 

0.38493 

1.70 

0.09405 

0. 

1.21 

0.19186 

0.38686 

1.71 

0.09246 

0. 

1.22 

0.18954 

0.38877 

1.72 

0.09089 

0. 

1.23 

0.18724 

0.39065 

1.73 

. 0.08933 

0. 

1.24 

0.18494 

0.39251 

1.74 

0.08780 

0. 

1.25 

0.18265 

0.39435 

1.75 

0.08628 

0. 

1.26 

0.18037 

0.39617 

1.76 

0.08478 

0. 

1.27 

0.17810 

0.39796 

1.77 

0.08329 

0. 

1.28 

0.17585 

0.39973 

1.78 

0.08183 

0. 

1.29 

0.17360 

0.40147 

1.79 

0.08038 

0. 

1.30 

0.17137 

0.40320 

1.80 

0.07895 

0. 

1.31 

0.16915 

0.40490 

1.81 

0.07754 

0. 

1.32 

0.16694 

0.40658 

1.82 

0.07614 

0. 

1.33 

0.16474 

0.40824 

1.83 

0.07477 

0. 

1.34 

0.16256 

0.40988 

1.84 

0.07341 

0. 

1.35 

0.16038 

0.41149 

1.85 

0.07206 

0. 

1.36 

0.15822 

0.41309 

1.86 

0.07074 

0. 

1.37 

0.15608 

0.41466 

1.87 

0.06943 

0. 

1.38 

0.15395 

0.41621 

1.88 

0.06814 

0. 

1.39 

0.15183 

0.41774 

1.89 

0.06687 

0. 

1.40 

0.14973 

0.41924 

1.90 

0.06562 

0. 

1.41 

0.14764 

0.42073 

1.91 

0.06438 

0. 

1.42 

0.14556 

0.42220 

1.92 

0.06316 

0. 


0.14350 

0.42364 

1.93 

0.06195 

0. 

1.44 

0.14146 

0.42507 

1.94 

0,06077 

0. 

1.45 

0.13943 

0.42647 

1.95 

0.05959 

o.< 

1.46 

0.13742 

0.42786 

1.96 

0.05844 

o.' 

1.47 

0.13542 

0.42922 

1.97 

0.05730 

0.- 

1.48 

0.13344 

0.43056 

1.98 

0.05618 

o.< 

1.49 

0.13147 

0.43189 

1.99 

0.05508 

o.< 
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TABLE OF THE NORMAL CURVE —Continued 


z 

A 

x/<r 

z 

A 

0.05399 

0.47725 

2.50 

0.01753 

0.49379 

0.05292 

0.47778 

2.51 

0.01709 

0.49396 

0.05186 

0.47831 

2.52 

0.01667 

0.49413 

0.05082 

0.47882 

2.53 

0.01625 

0.49430 

0.04980 

0.47932 

2.54 

0.01585 

0.49446 

0.04879 

0.47982 

2.55 

0.01545 

0.49461 

0.04780 

0.48030 

2.56 

0.01506 

0.49477 

0.04682 

0.48077 

2.57 

0.01468 

0.49492 

0.04586 

0.48124 

2.58 

0.01431 

0.49506 

0.04491 

0.48169 

2.59 

0.01394 

0.49520 

0.04398 

0.48214 

2.60 

0.01358 

0.49534 

0.04307 

0.48257 

2.61 

0.01323 

0.49547 

0.04217 

0.48300 

2.62 

0.01289 

0.49560 

0.04128 

0.48341 

2.63 

0.01256 

0.49573 

0.04041 

0.48382 

2.64 

0.01223 

0.49585 

0.03955 

0.48422 

2.65 

0.01191 

0.49598 

0.03871 

0.48461 

2.66 

0.01160 

0.49609 

0.03788 

0.48500 

2.67 

0.01130 

0.49621 

0.03706 

0.48537 

2.68 

0.01100 

0.49632 

0.03626 

0.48574 

2.69 

0.01071 

0.49643 

0.03547 

0.48610 

2.70 

0.01042 

0.49653 

0.03470 

0.48645 

2.71 

0.01014 

0.49664 

0.03394 

0.48679 

2.72 

0.00987 

0.49674 

0.03319 

a. 48713 

2.73 

0.00961 

0.49683 

0.03246 

0.48745 

2.74 

0.00935 

0.49693 

0.03174 

0.48778 

2.75 

0.00909 

0.49702 

0.03103 

0.48809 

2.76 

0.00885 

0.49711 

0.03034 

0.48840 

2.77 

0.00861 

1 0.49720 

0.02965 

0.48870 

2.78 

0.00837 

0.49728 

0.02898 

0.48899 

2.79 

0.00814 

0.49736 

0.02833 

0.48928 

2.80 

0.00792 

0.49744 

0.02768 

0.48956 

2.81 

0.00770 

0.49752 

0.02705 

0.48983 

2.82 

0.00748 

0.49760 

0.02643 

0.49010 

2.83 

0.00727 

0.49767 

0.02582 

0.49036 

2.84 

0.00707 

0.49774 

0.02522 . 

0.49061 

2.85 

0.00687 

0.49781 

0.02463 

0.49086 

2.86 

0.00668 

0.49788 

0.02406 

0.49111 

2.87 

0.00649 

0.49795 

0.02349 

0.49134 

2.88 

0.00631 

0.49801 

0.02294 

0.49158 

2.89 

0.00613 

0.49807 

0.02239 

0.49180 

2.90 

0.00595 

0.49813 

0.02186 

0.49202 

2.91 

0.00578 

0.49819 

0.02134 

0.49224 

2.92 

0.00562 

0.49825 

0.02083 

0.49245 

2.93 

0.00545 

0.49831 

0.02033 

0.49266 

2.94 

0.00530 

0.49836 

0.01984 

0.49286 

2.95 

0.00514 

0.49841 

0.01936 

0.49305 

2.96 

0.00499 

0.49846 

0.01889 

0.49324 

2.97 

0.00485 

0.49851 

0.01842 

0.49343 

2.98 

0.00471 

0.49856 

0.01797 

0.49361 

2.99 

0.00457 

0,49861 










Appendix B 


435 


TABLE OF THE NORMAL CURVE —Continued 


x/v 

z 

A 

x/c 

z 

A 

3.00 

0.00443 

0.49865 

3.50 

0.00087 

0.49977 

3.01 

0.00430 

0.49869 

3.51 

0.00084 

0.49978 

3.02 

0.00417 

0.49874 

3.52 

0.00081 

0.49978 

3.03 

0.00405 

0.49878 

3.53 

0.00079 

0.49979 

3.04 . 

0.00393 

0.49882 

3.54 

0.00076 

0.49980 

3.05 

0.00381 

0.49886 

3.55 

0.00073 

0.49981 

3.06 

0.00370 

0.49889 

3.56 

0.00071 

0.49981 

3.07 

0.00358 

0.49893 

3.57 

0.00068 

0.49982 

3.08 

0.00348 

0.49897 

3.58 

0.00066 

0.49983 

3.09 

0.00337 

0.49900 

3.59 

0.00063 

0.49983 

3.10 

0.00327 

0.49903 

3.60 

0.00061 

0.49984 

3.11 

0.00317 

0.49906 

3.61 

0.00059 

0.49985 

3.12 

0.00307 

0.49910 

3.62 

0.00057 

0.49985 

3.13 

0.00298 

0.49913 

3.63 

0.00055 

0.49986 

3.14 

0.00288 

0.49916 

3.64 

0.00053 

0.49986 

3.15 

0.00279 

0.49918 

3.65 

0.00051 

0.49987 

3.16 

0.00271 

0.49921 

3.66 

0.00049 

0.49987 

3.17 

0.00262 

0.49924 

3.67 

0.00047 

0.49988 

3.18 

0.00254 

0.49926 

3.68 

0.00046 

0.49988 

3.19 

0.00246 

0.49929 

3.69 

0.00044 

0.49989 

3.20 

0.00238 

0.49931 

3.70 

0.00042 

0.49989 

3.21 

0.00231 

0.49934 

3.71 

0.00041 

0.49990 

3.22 

0.00224 

0.49936 

3.72 

0.00039 

0.49990 

3.23 

0.00216 

0.49938 

3.73 

0.00038 

0.49990 

3.24 

0.00210 

0.49940 

3.74 

0.00037 

0.49991 

3.25 

0.00203 

0.49942 

3.75 

0.00035 

0.49991 

3.26 

0.00196 

0.49944 

3.76 

0.00034 

0.49992 

3.27 

0.00190 

0.49946 

3.77 

0.00033 

0.49992 

3.28 

0.00184 

0.49948 

3.78 

0.00031 

0.49992 

3.29 

0.00178 

0.49950 

3.79 

0.00030 

0.49992 

3.30 

0.00172 

0.49952 

3.80 

0.00029 

0.49993 

3.31 

0.00167 

0.49953 

3.81 

0.00028 

0.49993 

3.32 

0.00161 

0.49955 

3.82 

0.00027 

0.49993 

3.33 

0.00156 

0.49957 

3.83 

0.00026 

0.49994 

3.34 

0.00151 

0.49958 

3.84 

0.00025 

0.49994 

3.35 

0.00146 

0.49960 

3.85 

0.00024 

0.49994 

3.36 

0.00141 

0.49961 

3.86 

0 00023 

0.49994 

3.37 

0.00136 

0.49962 

3.87 

0.00022 

0.49995 

3.38 

0.00132 

0.49964 

3.88* 

0.00021 

0.49995 

3.39 

0.00127 

0.49965 

3.90 

0.00020 

0.49995 

3.40 

0.00123 

0.49966 

3.91 

0.00019 

0.49995 

3.41 

0.00119 

0.49968 

3.92 

0.00018 

0.49996 

3.42 

0.00115 

0.49969 

3.94 

0.00017 

0.49996 

3.43 

0.00111 

0.49970 

3.95 

0.00016 

0.49996 

3.44 

0.00107 

0.49971 

3.97 

0.00015 

0.49996 

3.45 

0.00104 

0.49972 

3.98 

0.00014 

0.49997 

3.46 

0.00100 

0.49973 

4.00 

0.00013 

0.49997 

3.47 

0.00097 

0.49974 

4.02 

0.00012 

0.49997 

3.48 

0.00094 

0.49975 

4.04 

0.00011 

0.49997 

3.49 

0.00090 

0.49976 

4.06 

0.00011 

0.49996 


* For skipped x/a items below, read values next preceding. 












APPENDIX C 


Sample Examination Questions in Statistics 
for Use as a Review Examination at the Beginning 
of the Course in Test Theory 1 


The following two experiments were performed: 

Experiment 1. The average of the men on a physical sciences test is 243.0. The 
average of the women is 226.5. The standard error of the difference 
is 5.0. 

Experiment 2. The average of the men on an English test is 158.4. The average of 
the women is 182.4. The standard error of this difference is 16.0. 

Mark the following statements according to this code: 

1. Applicable to experiment 1 

2. Applicable to experiment 2 

0. Applicable to neither experiment 1 nor to experiment 2 

-It would be worth while repeating this experiment with twice as many cases in 

each group. 

-It would be worth while repeating this experiment with four times as many 

cases in each group. 

_Since chance variation will not explain the results of this experiment, it is 

plausible to assume that there is a sex difference in the ability involved in this 
test. 

-Since chance variation will explain the results of this experiment, I do not feel 

that it is worth while to investigate this problem any further. 

-Differences larger than those obtained in this experiment would occur only one 

time out of a thousand if the two groups differed only by chance. 

-- There is only one chance out of a thousand that the difference between the two 
groups was due to the influence of chance. 

-The difference of means is significant. 

-- The difference of means is not significant. 

1 If students are not required to memorize formulas, items such as these are suit¬ 
able for “open-book” examinations. 
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The Theory of Mental Tests 

An intelligence test and an arithmetic test are given to a group of 1000 students. 

The correlation coefficient, means, standard deviations, and parameters of both re¬ 
gression lines are computed. In rechecking the results it is found that the intelligence 

test has been scored accurately, but there were a number of errors in the scoring of 

the arithmetic test. Assume that these errors are completely random errors. 

For each of the measures listed below write: 

1. If the correct value is larger than the value already computed. 

2. If the correct value is smaller than the value already computed. 

3. If the two values are the same . 

_The standard error of the mean of the arithmetic scores. 

_The standard deviation of the distribution of intelligence scores. 

_The mean arithmetic test score. 

_The standard deviation of the errors made in predicting arithmetic test score 

from intelligence test score. 

-The coefficient of alienation. 

_The coefficient of correlation between intelligence and arithmetic. 

_The variance of the predicted arithmetic test scores (that is, predicted from the 

regression of arithmetic test score on intelligence test score). 

_The variance of the observed arithmetic test scores minus the variance of the 

predicted arithmetic test scores. 

_The ratio of the standard deviation of the predicted intelligence test scores 

(that is, predicted from the arithmetic score) to the standard deviation of the 
observed intelligence test scores. 

_The square of the alienation coefficient plus the square of the correlation 

coefficient. 

_The product of the alienation coefficient and the standard deviation of the 

distribution of observed scores. 

_The slope of the regression of arithmetic on intelligence. 

_The standard deviation of the arithmetic test. 

._The slope of the regression of intelligence on arithmetic. 
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Before each of the following items write the number of the one formula from the 

following list of six that is most directly connected with the problem to be solved. 

Be sure not to make any calculations, just indicate the one best formula in each case. 

1 . M + Zs m 88 
2 Mi-Jh ^ _ VtJ/Ni + tf/Nt) 

8d 

3. - Ml -~ — (s'd - V(l/A0(*i* + «J 2 - 2^*2)) 

8d , 

4 . kSy 

5 

N8%$y 

.,©X + *-,(S)jf. 

_How can I estimate the geometry score of a student from his performance in 

algebra? 

_How far wrong is one likely to be when using arm-length to estimate height. 

_A sample of 100 Wistar adult white rats has an average weight of 342.5 grams; 

the standard deviation of the distribution of weight is 9.3 grams. What are 
reasonable upper and lower limits for the average weight of all Wistar adult 
white rats? 

_Which of two aptitude tests would it be better to use for estimating grades in 

this college? 

_An experiment is performed using two persons (one brother and one sister) from 

each of a hundred families. An intelligence test is given to these two hundred 
persons. Do brothers score higher than their sisters? 

_An instructor has two classes. In one there are 150 students, and in the other 

there are 136 students. The same intelligence test is given to the entire group 
of 286 students. Is the average intelligence of one class clearly higher than 
that of the other? 

_I want to predict the speed with which a rat will learn maze B from its per¬ 
formance in maze A. 

_Are rats more active on days when they have thyroid extract than on control 

days when they do not get the extract? 
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The Theory of Mental Tests 

Before each statement given below, put a circle around the number(s) indicating 

the assumptionfs) which must be made if the statement is to be regarded as correct. 

Use the following code: 

1. A sero point which is not arbitrary. 

2. A constant unit of measurement. 

3. The assumed mean is approximately equal to the true mean. 

4. The cases are evenly distributed within the class interval. 

5. The number of cases in a class interval varies inversely with the distance from 
the mean. 

6. The two distributions have the same number of cases. 

7. The two distributions have the same mean. 

8. The statement is correct as it stands, no assumption being involved. 

12345678 The differences between brothers can be measured by taking the 
differences of their test scores. 

12345678 The differences between brothers can be measured by taking the 
ratios of their test scores. 

12345678 The mean may be computed by grouping the data in class 
intervals. 

12345678 The simplest method of calculating the mean is by using grouped 
data and an equivalent scale with an assumed mean and an arbi¬ 
trary origin. 

12345678 The standard deviation of a distribution may be calculated by 
using grouped data and an equivalent scale with an assumed 
mean and an arbitrary origin. 

12345678 The mean may be calculated from the formula M — 2 X/N. 

12345678 The median may be calculated from a frequency distribution 
plotted in class intervals of ten. 

12345678 A class of students is divided into two sections for the purpose of 
administering a given test. The standard deviation of the total 
class can be calculated from the means, standard deviations, and 
number of cases for each section. 

1 2 3 4 5 6 7 8 In the case just mentioned, the mean of the total class can be 
calculated if one knows the mean and number of cases for each 
section. 

12345678 John is twice as intelligent as James. 

12345678 If a class in geometry has been given three tests during the 
semester, the final ranking of the students can be determined 
by summing these three scores for each student. 
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Appendix C 

All applicants for admission to the university are given an English examination 
and a scholastic aptitude test. In 1933 the standards for admission based entirely 
on scholastic aptitude scores were raised. As a result the size of the entering class 
decreased markedly. 

Mark the following items: 

1. If it will probably be larger in the freshman class of 1932. 

2. If it will probably be larger in the freshman class of 1933. 

3. If it will probably be about the same in both classes. 

4. If not enough data are given, or if code numbers given above do not apply. 

- Mean score on the scholastic aptitude test. 4 

_Standard deviation of the English examination. 

-Pearson correlation coefficient between English and aptitude scores. 

-Variance of the English scores as estimated from the aptitude test scores. 

_The standard deviation of the errors of prediction of English from aptitude test 

scores. 

_Coefficient of alienation. 

_Slope of the line of regression of English on aptitude scores. 

_Slope of the line of regression of aptitude on English scores. 

_The ratio of the standard error of estimate (that is, error made in estimating 

aptitude scores from English scores) to the standard deviation of the aptitude 
scores. 

_Standard deviation of aptitude scores. 

For each of the statements below write: 

1. If it applies to the mode. 

2. If it applies to the median. 

3. If it applies to the mean. 

4. If it applies to none of these terms. 

_The abscissa of the highest point on the frequency distribution. 

_The ordinate of the highest point on the cumulative frequency curve. 

_The ac-value of the steepest part of the cumulative frequency curve. 

_The point halfway between the two extreme values of the distribution. 

_ The score value so chosen that exactly 50 per cent of the scores are higher 

than it. 


The measure which lends itself most readily to algebraic treatment. 
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The Theory of Mental Tests 

Calculate the mean and standard deviation of the following distribution: 


Score 

Frequency 

160-174 

10 

145-159 

22 

130-144 

45 

115-129 

18 

100-114 

5 


Present all your calculations in an orderly data sheet. 

Suppose that after you have computed a mean, mode, median, range, and standard 
deviation on the data shown in the above tabulation, you find that two scores that 
are tabulated as 146 in the distribution are erroneous and should be tabulated as 129. 
Do not calculate means or standard deviations; answer simply from the general trend of 
the data. 

The value of the mean already computed is (the same as, larger than, smaller than) 
the correct value. 

The value of the mode already computed is (the same as, larger than, smaller than) 
the correct value. 

The value of the median already computed is (the same as, larger than, smaller 
than) the correct value. 

The value of the standard deviation already computed is (the same as, larger than, 
smaller than) the correct value. 

The value of the range already computed is (the same as, larger than, smaller than) 
the correct value. 

Given originally a symmetrical distribution of 500 cases with a mean of 100 and 
a a of 25. 

Add to this a second symmetrical distribution of 100 cases with individual scores 
ranging from 110 to 126. 

Before each of the measures listed below write: 

1. If the measure is increased by adding the second distribution. 

2. If the measure is decreased by adding the second distribution. 

3. If the measure is not affected by adding the second distribution. 

4. If it is impossible to tell what will happen from the information given. 

_Mean 

-Median 

_Range 

_Mode 

_Standard deviation 

_Average deviation 
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Below are shown curves for six distributions, some of which are cumulative fre¬ 
quency curves and others are column diagrams. 



Encircle the correct alternative, or alternatives, in each of the parentheses. Be 
sure to distinguish carefully between distribution , which is the general term, and 
column diagram and cumulative frequency curve, which refer to specific types of 
plots. 


The column diagram with the smallest standard deviation (A, B, C, Z>, E, F, None) 

The column diagram with the smallest mode. {A, B, C, D, E, F 1 None) 

The column diagram with the largest mean. (A, B, C, D, E, F f None) 

The column diagram with the smallest range. (A, B, C, D, E, F, None) 

The column diagram with the smallest N . (A, B, C, D y E, F, None) 

The cumulative frequency curve with the smallest standard 

deviation. (A, B, C, Z>, E, F t None) 

The cumulative frequency curve with the largest mean. ... (A, B t C, D, E t F t None) 

The cumulative frequency curve with the largest mode. ... (A, B, C, D, E, F, None) 

The distribution with the largest range. (A, B, C, D t E, F, None) 

The distribution with the smallest range. (A, B, C, D, E, F, None) 

The distribution with the largest median. (A, B, C, D, E, F, None) 

The distribution (s) which is (are) negatively skewed and 

unimodal. (A, B, C, D, E t F, None) 

The distribulion(s) which is (are) positively skewed and 

unimodal. (A, B, C, D, E, F t None) 

The distribution (s) which is (are) bimodal. (A, B, C, D, E, F, None) 
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The Theory of Mental Tests 

Fill in the spaces in the following table:* 

The relationships between the three terms M, N, and XX are such that it is possible 
to find the value of one of them when the other two are given. Below are a series of 
such problems. , 

Fill in the blank spaces. 


M 

N 

XX 

4 

6 


35 


140 


7 

84 

8 


104 


4 

100 

7 

9 



If the mean of a given set of scores (X) is 12 and the number of cases is 
15, find X(X — k), where k is 5. 

Am. X(X — k) *•_ 

If XX is 133 and XY is 95, the mean of the X-scores is 7 and the mean of the Y- 
scores is 5. What is the value of X(X — 7)? 

Ans. X(X - Y) -_ 

If 25 students took both a vocabulary test and an intelligence test, and the follow¬ 
ing averages were found: 

Vocabulary test average *» 56; intelligence test average » 51; then 

The sum of all the scores in the vocabulary test is_ 

The sum of all the scores in the intelligence test is_ 

The sum of all the scores in both tests is_ 

If each student is given a composite score, which is found by taking his vocabulary 
test score and adding to it the intelligence test score multiplied by 2, the average of 
this set of 25 composite scores will be 


If each student is given a composite score, which is found by subtracting his intel¬ 
ligence test score from his vocabulary test score, the average of these composite 
scores will be 


If a new vocabulary score is found by deducting 10 from each student’s original 
vocabulary test score, the sum of these new scores will be 

And the average of these new scores will be 
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In the following problems e and k represent constants; also M (the mean) and N 
(the number of cases) are constants. X and Y are variables, S indicates “the sum of.” 
Simplify each of the following expressions. 


N 

N 

S MX «* 

S (NY+ C-XM+ X 2 ) 

1 

1 

N 

N 

S (kX + cY) - 

S (Y - c) - 

1 

1 

N 

N 

S ( YN) - 

S (k + Y)- 

1 

! 

N 

N 

S (2 XM) - 

S [(X + Y)(X - Y)] - 

1 

1 

N 

N 

S (2 M 2 ) - 

S [(X + Y)Y\ - 

1 

1 

N 

N 

8 ( NM 2 ) - 

S l(X + Y) 2 ) - 

1 

1 

N 

N 

8 (M 2 - kX) - 

S l(X - Y)(Y + kX )] - 


1 1 


N 

S (MX + y* + cF) - 
1 


If V, V, W, and Z are variables and a, b, e, and d are constants, simplify the follow¬ 
ing expressions: 

SoM - 
XVaWc m 
2 (U + c)(Z -d) - 

In a positively skewed distribution: 

The is generally (larger than, smaller than, the same as) the median. 

The mode is generally (larger than, smaller than, the same as) the mean. 

The is generally (larger than, smaller than, the same as) the median. 
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Each of the following cases, shows a discrepancy between the mean as calculated 
from the formula M ’* XX/N and the mean calculated by grouped data and an ar* 
bitrary origin. 

Before each of the following cases place the number of the one comment that best 
applies. 

Comments: 

1. There must be an error in computation or tabulation. 

2. If the arbitrary origin were placed nearer to the mean of the distribution, the 
discrepancy between the two means would be partially corrected. 

3. Such a discrepancy between the two means is reasonable. 

4. There is either some error in computation or tabulation, or else poor judgment 
has been shown in choosing the limits for the class intervals. 

5. Such a discrepancy between the two means is reasonable when one has such a 
large standard deviation. 



































APPENDIX D 

Sample Examination Items in Test Theory 1 


After each of the following statements, encircle the letter or letters of all the state¬ 
ments which apply. Use the following code. 4 

O — a statement that could not reasonably be true. 

T «= a statement that is unconditionally true. 

A =» true if the mean error is assumed to be zero. 

B — true if the correlation between errors and true score is zero. 

C = true if the correlation between two sets of errors is zero. 

D = true if the standard deviation of two sets of errors are the same. 


The observed score is equal to the true score plus the error... O T A B C D 

Equivalent forms of a test will have the same standard devia¬ 
tion . O T A B C D 

The average true score is equal to the average observed score. O T A B C D 

The true variance is equal to the error variance plus the ob¬ 
served variance. 0 T A B C D 

The error variance is equal to the observed variance multi¬ 
plied by the difference between unity and the reliability 
coefficient... 0 T A B C D 

The average error is equal to the sum of the errors divided by 
the number of errors. 0 T A B C D 

The true variance is equal to the reliability coefficient multi¬ 
plied by the observed variance. 0 T A B C D 

The correlation between true scores and observed scores is 
equal to the square of the reliability coefficient. O T A B C D 

The square root of the difference between unity and the reli¬ 
ability coefficient is equal to the correlation between ob¬ 
served scores and error scores. 0 T A B C D 

The observed variance less the true variance is equal to the 
error variance.O T A B C D 


1 If students are not required to memorize formulas, items such as these are suit¬ 
able for “open-book' 1 examinations. 
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The Theory of Mental Tests 

Miscellaneous formulas applicable to a single test: 

r ■* reliability of the test. 
r T *■ reader reliability of the test. 
e ■» the standard error of measurement. 
a ■» the true standard deviation. 

8 «* the standard deviation of the test scores. 

d *» the standard deviation of the difference between scores on comparable halves 
of the test. 

1 . a*/8* 

2. e/8 

3 . v/8 

4. Vr 


For each of the following items write the number of the one or more formulas that 
are clearly indicated. Be sure to give all the answers that are correct. 

_The reliability coefficient. 

_The correlation between true scores and observed scores. 

_The correlation between errors and observed scores. 

_The correlation between comparable halves of a test. 

_The total test variance. 

_The content reliability of the test. 

_The true variance divided by the reliability coefficient. 

_The Spearman-Brown formula would need to be used on this quantity in order 

to get the reliability of the test. 

-This will decrease with an increase in the length of the test. 

_The index of reliability. 

Given the following information from the manual on each of two standardized 
spelling tests: *_ 




Standard 

Reli¬ 

ability 


Mean 

Devia¬ 

tion 

Test A 

100 

20 

.81 

Test B 

200 

40 

.95 


5. Vl - r 

6. <r s + e 2 

7. r/r, 

8 * + <P 


Estimate the standard error of measurement of test A. 
Estimate the standard error of measurement of test B. 
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Estimate the correlation between true scores and observed scores for 
test A. _ 

Student X scores 100 in test A* What is the standard error of this 
score? _ 

What is a reasonable upper limit for the true score of student X? _ 

What is a reasonable lower limit for the true score of student X? _ 

Student Y scores 300 in test B. What is the standard error of this 
score? _ 

What is a reasonable upper limit for the true score of student Y? _ 

What is a reasonable lower limit for the true score of student Y? - 

What is the standard deviation of the true scores in test A? - 

What is the standard deviation of the true scores in test B? - 

What is the mean of the true scores in test A? - 

What is the mean of the error scores in test B? - 

Give the index of reliability for test B. - 

What is the correlation between true scores and error scores for test A?- 

What is the correlation between observed scores and errors for test B?- 

Student Z receives a standard score of 1.5 in test A. What is his gross 
score? - 

What is the standard error of the standard score of 1.5?- 

What is the probable upper limit for the true standard score of stu¬ 
dent Z? - 

What is the probable lower limit for the true standard score of stu¬ 
dent Z? - 

Test A is selected by you and given to your class with the following 
results, M m 100; <r « 40. Comment on these results. 


Test B is selected by you and given to your class with the following results, M ■* 250; 
9 m 30. Comment on these results. 
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An arithmetic test is reported to have a standard error of measure¬ 
ment of 10. 

Estimate the reliability of this test when it is given to a class with 
mean 100 and standard deviation 20. 

Estimate the reliability of this test when it is given to a class with 
mean 200 and standard deviation 10. 

Estimate the reliability of this test when it is given to a class with 
mean 150 and standard deviation 40. 


Various standard error formulas, together with erroneous formulas: 


rn — reliability coefficient of test 1. 

ri 2 * the correlation between tests 1 and 2 which are not necessarily different 
forms of the same test. 

<ri *■ the standard deviation of the test scores for test 1. 


1. .iVTT 


rn 


2. <nVT-- 


nr 


3. aiVl - 


T 12 


4. oiVI — 


**12 


6. <ri\/2 \/l - rn 2 

7. 01 V 2 V^l — ri2 


8. o'i'v/2 y /1 — ri2 2 

9. cri's/rn \/l — rn 


5. <riV2 Vl - rn 


For each of the following items write the number of the one formula which is most 

clearly suggested. 

_By how much does this student’s test score deviate from his true score? 

_What is the extent of the error I am likely to make in estimating college grades 

by a scholastic aptitude test? 

-Mr. A is using one form of the Otis test, Mr. B is using a parallel form of the 

same test. By how much are their scores for the same people likely to differ? 

-I have given ten different forms of this test to Mr. X. Can I estimate the 

standard deviation of this distribution of ten scores? 

-The formula for the standard error of measurement. 

-The formula for the standard error of substitution. 


-The formula for the standard error of estimate. 
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■ — The standard deviation of the distribution of differences between scores on 
parallel forms of a test. 

-The smallest standard error in the group. 

-The standard deviation of the errors made in regarding the obtained score as 

the true score. 

-The error made in predicting true score from the fallible scores. 

Indicate the type of scores to which each of the following statements refers by 

using the following code. Give the one best answer for each item. 

1. Raw score. 5. Absolute scores. 

2. Standard score. 6. Mental age scores. 

3. Percentile scores. 7. I.Q. scores. 

4. Normalized scores. 8. None of the foregoing scores. 

-Gives a linear plot with chronological age for ages 2 to 10 years, if the average 

score for large groups is used. 

-This distribution must be Gaussian. 

_The frequency distribution of these scores is rectangular. 

_These scores are linearly related to raw scores. 

_The origin of these scores is at zero ability. 

_The unit of measure is in some testable sense constant at different points on 

the scale. 

_In groups that are homogeneous with respect to attainment level, this score is 

likely to be correlated negatively with chronological age. 

_There is a procedure for checking to see whether or not these scores may be 

applied to a given set of data. 

_If the raw score distribution is Gaussian, the plot of these scores against stand¬ 
ard scores will be linear. 

_The result of the plot of these scores against normalized scores is the integral 

of the normal probability curve. 

_These scores are comparable from distribution to distribution, in the sense 

that different groups have the same mean and standard deviation. 

. These scores assume that all differences in the frequency distributions for dif¬ 
ferent groups are due solely to the peculiarities of the test. 

_These scores assume that after all rank order is the important thing to consider. 
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Derive the equation for the correlation between the observed test score and the true 
test score. 

As indicated below, be sure to distinguish between definitions, assumptions, and the 
derivation. 

Definitions used 

(15 lines) 

Assumptions used 

(5 lines) 

Derivation 

(15 lines) 

Final equation 

(1 line) 


Test x and test y are two different power tests of the same ability . 

Test x is a 400-item test, test y is a 1600-item test. 

Both these tests are given to group A (N — 2000), and to group B (N ■» 1000), 
with the following results. 



Mean 

a 

Group 

Test 

Case 1 

200 

25 

A 

X 

Case 2 

170 

16 

B 

X 

Case 3 

700 

90 

A 

y 

Case 4 

600 

60 

B 

y 


A number of different statistics, such as reliability, validity, error of measurement, 
were computed for each of the four cases indicated above. 

In column I compare the statistics for the same test and different groups. 

Write A if it will probably be larger in group A. 

B if it will probably be larger in group B. 

S if it will probably be about the same (except for sampling errors) in 
both groups. 

O if the data given do not suggest which will be larger. 

N if the statement is nonsense. 

■■ if they must be identical for both groups. 

In column II compare the statistics for the same group and different tests. 

Write x if it will probably be larger for test x. 
y if it will probably be larger for test y. 

S if it will probably be about the same (except for sampling errors) in 
both tests. 

O if the data given do not suggest the relative magnitudes of the two 
quantities. 

N if the statement is nonsense. 

■■ if they must be identical for both tests. 
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I II 

Same Tests, Same Groups, 

Different Different 

Groups Tests 

- -- The reliability coefficient. 

- - The standard error of measurement. 

- - The true standard deviation. 

- - The validity coefficient, for example, corre¬ 
lation with college grades. 

- - Reliability for infinite length. 

- - The observed variance minus the error 

variance. 

- - The correlation between observed scores 

and error scores. 

- - The standard error of substitution. 

- - The correlation between true scores and 

error scores. 

- - The validity coefficient when corrected for 

attenuation. 


Five comparable forms of a test are administered to the same class. It is necessary 
to predict the correlation between the average of the first two and the average of the 
last three forms. 

Start with the definition of r^xi+x 2 ^z 8 +* 4 +* 6 ^ and show that with certain 


conventional test theory assumptions, the required formula is 


rV6 

r \ + 3r + 2r* ' 


where r is the reliability of a single form of the test. 


For each step indicate at the right the assumption made, the definition used, or 
the operation performed. 


Derivation 


Assumption, Definition, 
or Operation 
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A test of arithmetic ability is given to each of three classes. In class A the testing 
conditions are excellent. In class B the testing conditions are about the same as in 
class A, except that an oversight of the tester allows the class two minutes more 
than given to class A to work the test. In class C the testing conditions are not uni¬ 
form; after the test was over it was found that about one-third of the class had mis¬ 
understood the nature of the test and had answered the questions with a bias that 
influenced the correctness of the responses. The following were the results obtained: 



Mean 

S.D. 

Average 
Number 
of Items 
Attempted 

Average 
Number 
of Errors 

N 

Class A 

50 

5 

100 


100 

Class B 

55 

6 

120 


124 

Class C 

45 

■ 

90 


81 


No person in any of the three classes finished the test. There were practically no 

items skipped. 

For each of the questions below write the number of the item that best applies. 

Use the following code: 

1. Class A. 5. False. 

2. Class B. 6. Can’t tell from data given. 

3. Class C. 7. Nonsense. 

4. True. 

_Would the reliability of the test be greater as calculated on class B or on class A? 

_In which class would the test reliability be greatest? 

_Would the test reliability be greater when calculated on class A or on class C? 

_Which class was best on the ability tested? 

_The reliability of the test calculated on the combined scores of classes A and C 

would be greater than that calculated on the combined scores of classes A 
and B. 

_The Kuder-Bichardson formula for reliability can legitimately be applied to 

the results of class A. 

_The Kuder-Bichardson formula could legitimately be applied to class B. 

_The Kuder-Bichardson formula could legitimately be applied to class C. 

_If an intelligence test were given to the three classes, which class would most 

likely have the lowest mean score? 

_The correlation between the scores in class A and class B would probably be 

higher than that for classes A and C. 













Miscellaneous formulas: 
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rn » a reliability coefficient. 
ri 2 ** a validity coefficient. 
a the true standard deviation of the test. 

8 ■“ the standard deviation of the test scores. 
e =* the standard error of measurement, 
n «■ the number of times a test is increased in length. 
1 ** subscript designating the test. 

0 *« subscript designating the criterion. 


1 . 

2 . 


3. 




4. an 

5. eVn 

6. m 2 (nr -f 1 — r) 

7. sVn Vnr -f 1 — r 

8. <r 2 n 2 

9. e 2 n 


For each of the following items write the number of the one formula that is most 

clearly suggested. 

-The true correlation between test and criterion. 

-The error variance of a test when it is increased in length. 

-The standard deviation of the raw scores of the augmented test. 

-If I quadruple the length of this test and give it again to the same group of 

students, what will happen to the true variance? 

_I should like to know how much this test would correlate with my criterion, if 

it were possible to measure the criterion with a reliability of unity. 

_Can I estimate the correlation that would exist between college grades and 

intelligence, if it were not for the errors of measurement in both variables? 

_This test has a standard deviation of 20 and a reliability of .80. What will the 

standard deviation probably be if I quadruple the length of the test? 

Give the numerical answer to the foregoing question here-- 

_This aptitude test has a reliability coefficient of .81, a validity coefficient of 

.64, a standard deviation of 30, and an error of measurement of 13+. I should 
like to estimate the validity coefficient the test would have if it were made per¬ 
fect, and correlated with the same criterion scores as before. 

Give the numerical answer to the foregoing question here--— 

_ Does the error of measurement of a test increase, diminish, or remain constant 

as the test is increased in length? 

_If I quadruple the length of this test would I expect any change in the standard 

deviation of the true scores? 
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Formulas showing the relationship between test length, heterogeneity, reliability, 
and validity: 


ru ■■ reliability coefficient for test of unit length, 
roi — validity coefficient for test of unit length. 

Rn »B reliability coefficient when altered by increasing either the length of the 
test or the heterogeneity of the group. 

Rox ** validity coefficient altered by increasing the length of the test, 
m, n as either coefficients or subscripts indicate the length of the augmented test. 

o » standard deviation of scores of the test of unit length. 

2 « standard deviation of scores of the test when it is altered either by in¬ 
creasing the length of the test or the heterogeneity of the group. 


1 ttrn 

’ 1 + (n - l)rn 


2 ^ n ( ri1 ~~ jj 

rn(Rn — 1) 

« /1 ~ rn 

’•Wnns 


5. 



+ rn 


6 . 


l — m 

m 2 

*o7" ni 


7. VRmmRnn 


*1 


.4. 1 -^(1 -r u ) 


For each item write the number of the one formula that is most clearly suggested. 
Give only one answer except where multiple answers are indicated. 

_This test has a reliability of .81. I should like to raise the reliability to .95. 

_In working with test X, Mr. B reports a reliability of .84 with mean 112, stand¬ 
ard deviation 85. Mr. C uses the same test and reports that he gets a reliability 
of .95. 


_I have a vocabulary test of 300 items with mean 210, standard deviation 15, 

and reliability .76. Can I estimate its correlation with another similar vocab¬ 
ulary test of 500 items, with a reliability of .81, mean 351, and standard devi¬ 
ation 21? 

_These formulas depend upon the assumption that the standard error of meas¬ 
urement of a test is invariant with respect to variations in the heterogeneity of 
the group taking the test. (Multiple answer possible.) 

_This intelligence test has a reliability of .80, but its correlation with grades is 

only .50. I wonder if I could make the test so long that its validity would 
increase to .70. 

__A given college entrance examination has a reliability of .80, mean 190, and 

standard deviation 15. The same examination is given next year, and it is 
discovered that the average score is 180 and the standard deviation is 26. 
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-On this vocabulary test, I find that the odd-even correlation is .81. 

-This 30-minute test has such a low validity that I wonder if its validity would 

be appreciably changed by making it into a 2-hour test. There would be space 
in the testing program for a test as long as that. 

-The 2-hour final examination that I have been giving for the last two years 

has a distressingly low reliability. Would it be worth while to consider giving 
a 6-hour final of the same type? 

_This formula can be readily derived from the correction for attenuation. 

A test is given to two different groups with these results: 



Mean 

■ 

N 

Group A 

100 

20 

200 

Group B 

150 

10 

400 


Mark each of the following items: 

A if it will be larger in group A. 

B if it will be larger in group B. 

S if it will be about the same in both groups. 

O if one cannot tell from the data given. 

N if the statement is nonsense. 

_The reliability coefficient of the test. 

_The standard error of measurement of the test. 

_The average achievement level of the group. 

_The reliability coefficient of the test if it is made four times as long. 

_The correlation between the scores of group A and the scores of group B. 

. . The ratio of the standard deviation of odd-item scores to the standard devia¬ 
tion of even-item scores. 

. The standard deviation of the difference between odd-item scores and even- 

item scores. 

. The slope of the line of regression of even-item scores on odd-item scores. 

_The true variance. 

r irrr The correlation between true scores and error scores. 
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Given a test with a reliability coefficient (ru) .84, length (A?) 100, and stands 
deviation (<r) 20. 

Estimate each of the following, giving: 

(a) The general formula expressed in symbolic form. 

(b) The numerical answer for the particular case. 

1. Reliability coefficient if the test is altered to length N (N » 250). 


(a) Formula 


(6) Numerical answer 


2. Length necessary if one is content to use a test with a reliability of .72. 


(a) Formula 


(b) Numerical answer 


3. The error variance for length N (N =* 250). 


(a) Formula 


(5) Numerical answer 


4. The true score variance for length N (N — 250). 


(a) Formula* 


(5) Numerical answer 


5. The obtained standard deviation for length N (N « 250). 


(a) Formula 


(6) Numerical answer 
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6. The correlation between the test of length k (100) and length N (260). 


(a) Formula 


(b) Numerical answer 


A class of 200 students is given the L and M forms of the Stanford Binet test. 
Below are given a number of different ways of estimating the reliability coefficient* 
of the Stanford Binet. 

In column A mark each of these methods: 

1 if it is the best method of estimating reliability. 

■+• if it is a reasonably good method of estimating reliability. 

0 if it is a method that could not give an estimate of reliability. 

In column B mark each of these methods: 

+ if it is necessary to use the Spearman-Brown correction. 

0 if it is not necessary to use this correction. 

A B 


- -Correlation of score on odd items with score on even items on form L. 

__Correlation of score on the first half of the test, with score on the second 

half on form M. 

_ _Correlation of score on form M with score on form L. 

_ _Use of the Kuder-Richardson formula (simplest form using only mean 

standard deviation and number of items) on the items of form L. 

_ _Give form M again and correlate scores on the first giving with those 

on the second for form M. 


A and B are comparable halves of a test, tab =* 60. The standard deviation of 
A is 14, and its mean is 103; corresponding figures for B are 26 and 106, respectively. 
Comment on the foregoing data with special reference to the reliability of the total 
test. 


1 and 2 are comparable halves of a test. *12 = .90. The mean and standard devia¬ 
tion of part 1 are respectively 147 and 34; the corresponding figures for part 2 are 148 
and 33. Comment on the foregoing data with special reference to the reliability of 
the total test. 
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A test of some simple mechanical ability in which practice has no effect is given 
twice to a class of 100 students. The standard deviation of each of the distributions 
is 10.0, the correlation between the two scores is .64. Assume that the distribution 
is normal, homoscedastic, etc. 

1. What is the probability that the score of any given student on the first test will 
deviate by more than 6 score points from his score on the second test? 


(a) Appropriate 
formula 


(6) Numerical value 
of appropriate 
standard error 


(c) Probability 


2. What is the probability that the score of any given student on the first test will 
deviate by more than 6 score points from the prediction made from scores on the 
second test? 


(a) Appropriate 
formula 


(6) Numerical value 
of appropriate 
standard error 


(c) Probability 


3. What is the probability that the score of any given student on the first test will 
deviate by more than 6 score points from his true score? 


(a) Appropriate 
formula 


(6) Numerical value 
of appropriate 
standard error 


(c) Probability 













Answers to Problems 


Note: Where discussions or derivations are called for, the answers are not given. 


Chapter 2 


Test 

Index of 
Reli¬ 
ability 

Standard 
Deviation 
of True 
Scores 

Correlation 
between 
Observed 
and Error 
Scores 

Error of 
Meas¬ 
urement 

A 

.95 

14.25 

.30 

4.50 

B 

.92 

23.64 

.40 

10.28 

C 

.88 

9.94 

.47 

5.31 

D 

.93 

71.14 

.36 

27.54 

E 

.87 

19.05 

.49 

10.73 



True Score Limits 
(Approximately 0.3 per cent level) 

(a) 115 on test A 

101.50-128.50 

(6) 211 on test B 

180.16-241.84 

(c) 31 on test C 

15.07- 46.93 

(d) 500 on test D 

417.38-582.62 

(e) 100 on test E 

67.81-132.19 



Minimum Difference 

Test 






C-2 

C -3 

A 

(12.726) 13 

(19.089)20 

B 

(29.072)30 

(43.608) 44 

C 

(15.017) 16 

(22.525) 23 

D 

(77.883) 78 

(116.825) 117 

E 

(30.344) 31 

(45.517) 46 
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The Theory of Mental Tests 


1 . 


1 . 


12 . 


Chapter 3 


A . T-X-E 


D. rEiBi ** 0 


B. Mg - 0 


E. 


C. r»» 0 


F. «i - 82 " * • 

* " 8k 


Chapter 4 



*d 



A. 5.66 

5.43 

4.00 

3.67 

B. 3.74 

3.47 

2.65 

2.24 

C. 20.84 

20.47 

14.74 

14.21 

D. 12.10 

11.76 

8.56 

8.07 

E. 10.08 

9.51 

7.13 

6.30 


(а) 8 e - 6.30 (c) 135.9 > T A > 98.1 

(б) s, - 6.30 (rf) 113.9 > T b > 76.1 


Chapter 6 


Estimated 

3. Test Reliability 

A .98 

B .84 

C .93 

D .95 

E .91 


Chapter 7 

3. (a) 619.35; (6) 198.69; (c) 3.18; (d) 5.50; (e) 31 items; (/) 300 items; (g ) 27 items. 


3 . 


Chapter 8 

g 2 SB gj 2 


ra8 x 8j - ni8i 2 

where «i 2 is the variance of one unit test, 

s 2 is the average variance of all unit tests, 
ru is the reliability of one unit test, and 

TipiSj is the average covariance of all unit tests. 

4 . (a) .97; ( b ) .96; (c) 3.89 times as long or 78 items; (d) 50 items; (e) 32 items; 
(/) .88; (g) 240 items. 
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Chapter 9 


1. (a) n i 5 m and (6) r W80 - 

2. Assumed that (r woo \Zr\ [) is not greater than Vru. 

4. .90. 

5. .64. 

6. (a) Mean 65, standard deviation 13.16, reliability .90, validity .76. 

(6) 34 new items, .54 new validity. 

(c) Test E. 

(i d) Test A. ¥ 

C e ) Test A. 

if) A B C D E 

.96 .68 .87 .89 .92 

iff) .77. 

(h) k = 3.86 or 4. 

(i) True variance of test C *- 100.75 (true variance of test C increased 

to 300 items * 906.75). 

Error variance of test C = 13.74 (error variance of test C increased 

to 300 items = 41.22). 

(j) Reliability of lengthened test = .97. 

Reliability of lengthened criterion = .82. 

Validity of lengthened measures =* .79. 

(k) .77. 

7. Test X items. 

Chapter 10 

2. (a) Reliability about .93. (6) Standard deviation about 6.2. (c) Yes. (d) No. 

(e) Time limit decreased. (J) Test E is unsuitable for sectioning a group with stand¬ 
ard deviation of 3.9. 



Chapter 11 

6. (a) .84; (6) 26.9. 6. (a) .90; (6) 42.7. 7. (a) .60; (6) 320.95. 9. 13.3 
10. (a) .57; (6) 13.3. 


Chapter 12 

S . Ryz ™ *80. Rxz * *77. 


Chapter 14 

Note. To facilitate computational checks, quantities such as 2), B, s 2 (or u), 
8 2 r (or w), and v as well as L, and — N logio L are given. 

In answering the question “Are the tests parallel? ,, the following convention was 
used: 

Yes indicates p-value greater than .05. 

No indicates p-value less than .01. 

? indicates p-value between .05 and .01. 

— indicates that the test for equality of means cannot be made because the 
data are not in agreement with H ve (or ft vc ). 
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1. D - 379,2X5.81; s 2 - 194.2733; s*r - 170.7367; v - 1.7034. 

Are the tests 

—AT logio £ dt parallel? 

- .8181 17.44 6 No 

U, - .9408 5.30 4 ? 

1-1V(* - 1) logic £] 

U - .9325 12.14 2 No 

2(a). D = 491.39363; s 2 - 10.4931; « 2 r - 5.620087; v - 14.5956. 

Are the tests 

—N logic £ df parallel? 

L mte m .002435 1568.1 11 No 

£»* - .1552 485.47 8 No 

[-N(fc-l) logic £) 

£* - .2503 1082.77 3 — 

2(6).' No. 

2(c). D - 140.09132; « 2 - 17.574598; s 2 r - 12.829812; v - 0.1168. 

Are the tests 

—N logio £ df parallel? 

Lmvc - -9477 13.998 2 No 

£»« - .9711 7.644 1 No 

Lm - .9760 6.330 1 — 

2(d). D = 8.0254985; s 2 = 3.411593; « 2 r - 1.872942; v = 0.1761. 

Are the tests 

—N logio £ df parallel? 

L mvc *» .8857 31.626 2 No 

£,c - .9870 3.408 1 No 

L m - .8973 28.236 1 — 

3(a). D - 39,750.431; s 2 - 24.47055; s 2 r - 13.0306; v - 0.2604. 

Are the tests 

—N logic £ df parallel? 
£*„ - .3905 61.45 11 No 

£m -.4177 47.77 8 No 


[-N(k - 1) log w £) 
£m - .9778 3.69 


3 
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3(6). D - 9239.3439; s 2 - 25.337167; * 2 r - 11.130033; v - 0.2714. 

Are (be tests 

N logw L df parallel? 

4.21 6 Yes 

2.13 4 Yes 

l-N(k - 1) log 10 L] 

L m - .9813 2.07 2 Yes 

8(e). D - 552.37637; s 2 = 26.63625; s 2 r = 12.4363; t> = 0.1678. 

Are the tests * 

—N logio L df parallel? 

L mvc - .98395 0.8845 2 Yes 

L ve - .99558 0.2429 1 Yes 

L„ = .98832 0.6426 1 Yes 

8(d). D - 351.65226; s 2 = 22.30485; s 2 r = 12.0692; v = 0.00201. 

Are the tests 

—N logio L df parallel? 

Lmvc - 99927 0.0397 2 Yes 

L vc “ .99946 0.02974 1 Yes 

L m - .99980 0.01134 1 Yes 

3(/). t> - 377,668.24; 6 - 2364.63; «* - 24.2706; w x - 11.0902; v x - 0.3836. 

Are the tests 

—N logio L df parallel? 

Lmvc - .8681 7.740 8 ? 

Lvc - .9194 4.599 6 Yes 

Lm - .9442 3.142 2 ? 

8(g). t> - 465,126.99; - 2408.2694; u z - 25.3372; w* - 11.1300; v z - 0.2714. 

Are the tests 

—AT logio L df parallel? 

Lmvc - .9213 4.486 8 Yes 

Lvc - .9568 2.417 6 Yes 

Lm - .9629 2.069 2 Yes 

4. i) - 1,866,681.57; 5 - 72,588.09; u, - 64.6667; w* - 68.2833; - 1.0. 


t*mvc 

- .4717 

—iV’logio l* 

16.31650 

df 

8 

Are the tests 
parallel? 

No 

Lvc 

- .0311 

9.995 

6 

No 

in 

- .7475 

6.31950 

2 

— 


Lmvc “ .9260 
Lvc - .9618 
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Chapter 15 

1. .82. 2. .91.8. .85.4. (a) .92; (b) .89. 5. .93.6. (a) .91; (b) .84; (c) .84. 
8. (a) .89; (b) ,96. 9. (a) .87; (b) .84; (c) .84; (d) 5.04; (e) (35.9-66.1), (57.9-88.1); 
(/) .93, .96, .98, .99; (g) about 484 items; (») .48; (j) .97. 


Chapter 16 

1. (.89), (.85). 2. (.81), (.74). 3. (a) (.91); (1 b ) (.88). 


1 * &v 


Chapter 17 




— (3200) + 1.544 - (1.544) 2 
500 


3.46 


2 . Frequencies 372 21 13 11 15 9 9 10 5 5 10 5 2 3 3 2 2 0 1 2 
Score 0 1 2 3 456 7 8 9 10 11 12 13 14 15 16 17 18 19 


3. .97 > R > .82 
.97 > R' > .80 


= 1.54 
s u = 3.46 


4. Lower Bound for Reliability Coefficient 


A .96 .96 

B .00 .71 

C .62 .76 

D .83 .85 

E —3.5 or zero .59 


5 . 


Error of Measurement Upper Bound 


A 4.75 

B 4.31 

C 3.24 

D 3.25 

E 4.75 


5.19 
8.74 
7.09 
3.29 

7.19 


2 . 


Chapter 19 

Lowest Gross Score 
6 c* Exceeding c + 2 c 


(a) 

10 

5 

15 

(b) 

50 

25 

61 

(e) 

4 

3.2 

8 

(d) 

20 

16 

89 

(«) 

1 

0.9 

3 
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9. a b e 

Set A 5.34 4.11 6.56 

B 6.65 2.24 11.07 

10. (o) t vt - 7.353-A,' + 138.97 
(b) w\ - 7.132B, + 88.21 

11. (a) Mean •• 111.5039; standard deviation — 22.1712 
(6) Wi = 1.2658A; - 74.174 

(c) w't - 1.036B< - 67.419 


Chapter 20 * 

1. .917. 2. .00. 3. X h - ,386Z a - M2X, + .905X, + 315.550. 4. .73. 
5. (a) and (s). 6. X p = ,2X a + .9X, + 97. 7. .721. 17. (o) .57; (6) .63; (c) .69. 


Chapter 21 

4. Delete items 2, 5, 22, 31, 32. 
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effect of univariate selection on, 145- 
156 

formulas showing effect of selection on, 
111, 124,133,137,142,148-153,156 
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Correlation, maximized by adjusting test 
lengths (reference), 101 
maximised for two weighted compo¬ 
sites, 348-351, 358 
multiple {see Multiple correlation) 
partial, effect of selection on, 147, 155 
formula for, 425 

Pearson product-moment formula, 424 
point-biserial, effect of item difficulty 
on, 393 

formula for, 426 

item parameters using, 377,378,382, 
387, 389, 390 

statistical criterion of equality for par¬ 
allel tests, 175, 177, 185, 186-187 
Correlation matrix, definition of, 428 
sum of terms in, 429 
Covariance, formula for, 424 
statistical criteria for equality in paral¬ 
lel tests, 175, 177, 185, 186-187 
Criterion, most predictable, 348-351, 358 
Cutting scores, multiple, compared with 
weighted composite, 312-314 

Determinants, linear equations solved 
with, 118, 422 

multiple correlation coefficient in terms 
of, 329 

regression weights in terms of, 118, 
328-329 

Deviation score, definition of, 424 
mean of, 424 

Difference, standard deviation of, 40, 45, 
199, 216, 425 

Differential tests, 351-355, 359 

Educational age, 286-291 
Educational quotient, 291 * 

Educational Testing Service, 67, 191 (p), 
349, 395(p) 

Ellipse, equation for, 423 
Equating tests, derivation of formulas 
for, 290-303 

many equating variables, 301-304,307 
regression line for, 297-298, 299-304, 
307 

tingle equating variable, 299-301, 307 
tingle group used for, 296-298,306-307 
statistical criterion for parallel tests, 
297 


Index 

Equating tests, two groups used for, 299- 
304, 307 

Equation, circle, 422, 424 
conic section (general), 423 
conic section (standard form), 423 
correlation (bfcerial), 426 
correlation (multiple), 329, 330 
correlation (partial), 425 
correlation (Pearson product-moment), 
424 

correlation (point-biserial), 426 
covariance, 424 
ellipse, 423 
hyperbola, 422-423 
linear, solution of, 422 
mean, 424 
parabola, 423, 424 
quadratic, solution of, 422 
regression line, 425 
second-degree, 423 
standard deviation, 424 
standard error of estimate, 425 
straight line, 422, 424 
variance, 424 

Error, constant or systematic, 6 
definition of four types, 39, 45 
random, definition of, 6-8, 26 
relative magnitude of various types, 
43-45 

Error of estimate (standard), error of 
measurement related to, 41, 45, 
49, 57 

formula for, 425 

selection (effect on), 132,142,147,155 
true score from observed score, 43, 45 

Error of measurement (squared), as a 
function of test score, 115-124, 
125 

for a mesokurtic distribution of test 
scores, 124 

for a normal distribution of test scores, 
124, 126 

for a symmetrical distribution of test 
scores, 123 

Error of measurement (standard), critical 
score points related to, 264-265, 
304 

doubling test length, effect on, 61-62, 
67 

error of estimate related to, 4% 57 
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Error of measurement (standard), error 
of substitution related to, 48-49, 
57 

formula for, 15-17, 26, 34, 37 
interaction related to, 50-54, 57 
item analysis related to, 392 
length of test (general case), 72-73 
norms affected by, 289-290 
standard deviation of a difference re¬ 
lated to, 40-42, 45, 48-49, 57 
time limits, effect on, 233-236, 242 
use illustrated, 17-22 
weighting scores inversely as, 336, 357 
Error of substitution, 40, 45 
Error score, basic assumptions, 4-6, 25 
correlation with observed scores, 23- 
25, 27, 35, 37 

correlation with true scores, 7, 26, 35- 
36,37 

four types defined, 39, 45 
mean, 6, 26, 33, 37 

standard deviation, 15-17, 26, 33-34, 
37 

standard deviation of various types, 
compared, 43-45 
variance, 8-11, 26, 33-34, 37 
variance (equation for), 15-17, 26, 34, 
37 

Error variance, comparison of various 
types, 43-45 

parallel tests defined by, 12-13, 26 
reliability, affected by, 114-115 
testing conditions affecting, 108-110 
weighting scores by, 331-334, 356 
Estimate, formula for standard error of, 
425 

standard error of, 41, 43, 45, 49, 57, 
132, 142, 147, 155, 425 
Examination questions, samples for sta¬ 
tistics, 437-446 

samples for test theory, 447-460 
Expansion of binomials, 421 
Expansion of polynomials, 421 

Factor analysis, weighting Scores, 343- 
345, 358-359 
Factorial notation, 421 
figures illustrating text, attenuation, 
correction for (Ch. 9, Fig* 5), 
103 


Figures illustrating text, correlation, first 
with second half for a speed test 
(Ch. 15, Fig. 1), 202 
odd versus even items for a speed 
test (Ch. 15, Fig. 2), 206 
true and fallible measures (Ch. 9, 
Fig. 3), 96 

criterion level and regression of cri¬ 
terion on test score (Ch. 19, Fig. 

4) , 294 

criterion percentage above critical level 
for given test score (Ch. 19, Fig.* 

5) , 295 

cutting scores versus weighted com¬ 
posite (Ch. 20, Fig. 2), 314 
difference between observed and true 
score (Ch. 2, Fig. 4), 18 
difference between two observed scores 
(Ch. 2, Fig. 5), 21 

errors of measurement, prediction and 
substitution (Ch. 4, Fig. 1), 44 
item selection for parallel tests (Ch. 15, 
Fig. 3), 209; (Ch. 15, Fig. 4), 210 
item selection to maximize test validity 
(Ch. 21, Fig. 1), 383; (Ch. 21, Fig. 
2), 384 

norms, affected by regression line (Ch. 
19, Fig. 1), 288 

affected by reliability and regression 
line (Ch. 19, Fig. 2), 289 
affected by selection (Ch. 19, Fig. 3), 
292 

parallel tests, item selection for (Ch. 
15, Fig. 3), 209; (Ch. 15, Fig. 4), 
210 

passing both versus passing either test 
(Ch. 20, Fig. 1), 313 
Pythagorean theorem, computing dia¬ 
gram for (Ch. 2, Fig. 2), 10 
regression line, and critical criterion 
level (Ch. 19, Fig. 4), 294 - 
effect on norms (Ch. 19, Fig. 1), 288 
reliability, and length of test (Ch. 8, 
Fig. 1), 80; (Ch. 8, Fig. 2), 81,» 
(Ch. 8, Fig. 3), 82; (Ch. 9, Fig. 1), 
91 

and selection of cases (Ch. 10, Fig. 
2), 113 

effect on norms (Ch. 19, Fig. 2), 
289 
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figures illustrating text, reliability, index 
erf reliability, correlation of ob¬ 
served and error scores (Ch. 2, 
Fig. 6), 24 

of difference score (Ch. 20, Fig. 3), 
354 

true standard deviation, error of 
measurement (Ch. 2, Fig. 3), 16 
selection, effect illustrated (Ch. 10, 
Fig. 1), 109; (Ch. 11, Fig. 1), 129 
effect on norms (Ch. 19, Fig. 3), 292 
effect on reliability (Ch. 10, Fig. 2), 
113 

explicit and incidental compared 
(Ch. 11, Fig. 4), 139 
explicit and validity (Ch. 11, Fig. 3), 
137 

incidental and validity (Ch. 11, Fig. 
2), 134 

standard deviation, effect on correla¬ 
tion illustrated (Ch. 10, Fig. 1), 
109; (Ch. 11, Fig. 1), 129 
observed, true and error (Ch. 2, 
Fig. 2), 10 

validity, and explicit selection (Ch. 11, 
Fig. 3), 137 

and incidental selection (Ch. 11, 
Fig. 2), 134 

and length of test (Ch. 9, Fig. 1), 91; 
(Ch. 9, Fig. 2), 92; (Ch. 9, Fig. 4), 
100 

maximized by item selection (Ch. 21, 
Fig. 1), 383; (Ch. 21, Fig. 2), 384 
variance, observed true and error (Ch. 
2, Fig. 1), 9 

weighted composite versus multiple 
cutting score (Ch. 20, Fig. 2), 314 
weighting components in a sum (Ch. 
18, Fig. 1), 254 

Guessing, correction of item parameters 
for, 371-373 

correction of test score for, 246-251,260 

Harvard University, 283 

Heterogeneity of group (see Selection of 
group) 

Homogeneity of items (see also Reliability) 
compared with reliability coefficient, 
220 


Homogeneity of items, reliability deter¬ 
mined by, 221-224, 226-227 
Hyperbola, equations for, 422-423 

Index of reliability, 22-23, 27, 32-33, 37 
Intelligence Quotient, 291 
Interaction variance, relation to error of 
measurement, 50-54, 57 
Item analysis, comparison of methods, 
364, 373-375, 380 

mean of test controlled by, 365-367, 
389 

mean of “unattempted” scores derived 
from, 238, 242-243 
psychophysics related to,392 
purpose of, 363, 365, 374 
references on, 363-364 
variance of “unattempted” score de¬ 
rived from, 239-240, 242-243 
Item construction, references on, 2 
Item difficulty, derivation of formula re¬ 
lating mean to, 366 
formula relating test mean to, 367, 389 
Horst’s correction for guessing, 371- 
373 

parameters designed to reduce costs, 
373-374, 385, 394 

parameters- invariant with changes in 
group, 367-371 

parameters to correct for guessing, 
371-373 

parameters utilizing item-criterion 
curve, 370 

parameters utilizing normal curve, 

368- 369 

parameters utilizing regression line, 

369- 371 

reliability as a function of, 221-224, 
226, 378, 389 

Thurstone’s method of calibration, 
367-368 

validity of test related to, 374-375 
Item difficulty index, computing proce¬ 
dures, 385 

formula relating test reliability to, 378, 
389 

Item homogeneity (see Homogeneity of 
items, and Reliability) 

Item parameters, classification of, 380 
mean of test related to, 365-367, 389 
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Item parameters, proportion answering 
correctly, 425 

reliability index and test standard de¬ 
viation, 377, 389 

reliability index defined, 377, 389 
reliability of test related to, 378-380, 
389 

standard deviation of test related to, 
375-378, 389 

unattempted items affect, 385-386 
validity of test related to, 380-385, 389 
variance of item, 425 
variance of test related to, 375-378, 
389 

Item reliability index, computing formula 
for, 387-388, 390 
definition, 377, 389 

reliability of test related to, 378-380, 
389 

standard deviation of test related to, 
375-377, 389 

validity of test related to, 382, 389 

Item selection, judgments of experts, 365 
parallel tests obtained by, 208-210 

Item selection theory, problems in, 391- 
394 

summary of, 388-391 

Item validity index, computing formula 
for, 387-388, 390 
definition, 382, 389 

invariant with item selection, 383, 384 
validity of test related to, 380-385, 
389 

Kurtosis, effect on error of measurement, 
124, 126 

equating tests affected by, 296, 307 

Lagrange multipliers, 347, 348-349 

Least squares fit for second degree poly¬ 
nomial, 117-118 

Length of test, correlation maximized by 
adjusting relative (reference), H)1 
effect of doubling on error of measure¬ 
ment, 61-62, 67 

effect of doubling on mean, 59-60, 67 
effect of doubling on reliability, 62-65, 
67 

effect of doubling on true variance, 61, 
62,67 


Index 

! Length of test, effect of doubling on vari- ' 
ance of gross scores, 60-61, 67 
effect on error of measurement (general 
case), 72-73 

effect on mean (general case), 69, 73 
effect on reliability (general case), 77- 
79, 86 

effect on true variance (general case), 
71, 73 

effect on validity (general case), 83-90, 
98-101, 104 

effect on variance of gross scores (gen¬ 
eral case), 69-71, 73 * 

experiments showing effect on relia¬ 
bility, 65-67 

for a specified reliability, 82-83, 86 
for a specified validity, 93-94, 104 
function of reliability invariant with 
changes in, 83-85, 86, 94,101,105 
function of validity invariant with 
changes in, 94, 101, 105 
. graphs showing effect on reliability, 
79-82 

graphs showing effect on validity (gen¬ 
eral case), 91, 92, 100 
graphs showing effect on validity of 
infinite, 96, 103 

Line, equation for straight, 422, 424 
Linear equations, solution of, 422 

Marginal performance, cutting score de¬ 
termined from, 293-296 
Matrices, elementary rules for manipu¬ 
lating, 159-160 

Lagrange multipliers used in solving 
equations, 347, 348-349 
most predictable criterion derived 
with, 348-351 

multiple correlation derived with, 162- 
164 

multivariate selection formulas derived 
with, 165-171 

weights maximizing reliability derived 
with, 346-348 

Mean, for deviation scores, 424 
for error scores, 6, 26, 33, 37 
for true scores, 8, 26, 29, 37 
of observed scores, derivation of for¬ 
mula relating item difficulty to, 
366 
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Mean, of observed scores, effect of dou¬ 
bling test length on, 69-60, 67 

effect of test length on (general case), 
69,73 

formula relating item difficulty to, 
367,389 

item parameters related to, 365-367, 
389 

statistical criteria for equality in 
parallel tests, 175, 179, 185, 187- 
188 

weighting scores by the, 336-338,356 

Mental age, 286-291 

Moments, third and fourth, for half test 
as functions of total test moments, 
121-122 

Multiple correlation, derivation of for¬ 
mula for, 162-165, 327-330 
error of estimate derived, 164-165 
general formula for, 329 
graphic approximation for, 331 
scoring formula using, 255-256 
three-variable formula for, 330 
weighting by approximations to, 330- 
331,356 

Multiple regression weights, derivation 
of formula for, 162-164, 327-330 
general formula for, 328-329 
standard deviations as weights, a spe¬ 
cial case of, 334-336 
three-variable formula for, 329 
undesirability of negative, 330 
weighting by reliabilities as a special 
case, 331-334 

Multiple trials, effect on standards, 265, 
304,312-313 

National Society for the Study of Edu¬ 
cation, 2, 411 

Normal curve, .formula for ordinate, 426 
table of, 431-435 

Parabola, equation of, 423, 424 

Parallel tests, defined in terms of ob¬ 
served scores, 28-29, 36 
defined in terms of true score and error 
variance,. 11-13, 26 
experimental procedure for obtaining, 
208-210 

qualitative criteria for, 173-174,194 


Parallel tests, reliability defined by, 13- 
14, 194-197, 214-215 
statistical criterion, basic statistics for, 
174 

for equality of means (I*,), 179-180 
for equality of means, of variances, 
of covariances (L mvc ), 174-176 
for equality of variances and covari¬ 
ances (L ve ), 176-179 
illustrative problem, 180-181 
purpose of, 173 

table of 5 per cent and 1 per cent 
points for, 180, 189 

Partial correlation, effect of selection on, 
147, 155 

formula for, 147, 425 
Point-biserial coefficient, effect of item 
difficulty on, 393 
formula for, 426 

item parameters using, 377, 378, 382, 
387, 389, 390 

Polynomials, expansion of, 421 
Power test, definition of, 230-233, 241 
effect of guessing on score, 246-251 
Princeton University, 173 
Problems, answers for, 461-467 
Product moments of half tests as func¬ 
tions of total test moments, 119- 
121 

Profiles, reliability of, 353 
Psychological Corporation, 354 
Psychophysics, item analysis related to, 
392 

relation to test theory, 392 

Quadratic equation, solution of, 422 

Random error, definition of, 6-8, 26 
Range, restriction of (see Selection of 
group) 

Regression, multiple (see Multiple correla¬ 
tion, Multiple regression weights) 
Regression line, cutting score determined 
from, 291-293, 306 
effect, on age-grade norms, 287-288 
effect of selection on slope of, 131,142, 
146, 155 

equating tests by, 297-298, 299-304, 
307 , * 

equation for, 425 
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Regression line, formula to determine 
cutting score on, 293-296 
Reliability, age-grade norms affected by, 
289-290 

analysis of variance related to, 221 
comparison of different methods of ob¬ 
taining, 207, 215 

correction for reader errors, 211-214, 
216 

defined as correlation between parallel 
tests, 13-14 

derivation of formula for content, 212- 
214 

double length test effect on, 62-65, 67 
error variance effect on, 114 
essay examinations, 211-214, 216 
experimental work showing effect of 
length on, 65-67 
formula for content, 214, 216 
formula for weighting to maximize, 
347,357 

function of item difficulty and test 
variance, 221-224, 226 
function of test mean and variance, 
225-227 

function of the variance of a difference, 
199 

function of variance of half-test scores, 
199 

graphs illustrating effect of length on, 
79-82 

heterogeneity of group, effect on, 110— 
114, 124 

index of, 22-23, 27, 32-33, 37 
instability of a trait, 197 
invariant function of (with changes in 
group variability), 140-141, 143 
invariant function of (with changes in 
test length), 83-85, 86, 94, 101, 
105 

item homogeneity measures compared 
with, 220 

item parameters related to, 378-380, 
389 

judgment of test constructor measured 
by, 220 

Kuder-Kichardaon formulas for, 223- 
224, 225-226 

length of test effect (general case), 77- 
79,86 


Reliability, length of test necessary for 
specified, 82-83, 86 
matched random subtests, 207-210, 

215 

odd-even, affected by time limits, 236- 
238, 242 

odd versus even items, 205-207, 215- 

216 

parallel tests used for computing, 194- 
197, 214-215 

parallel thirds, 207-209, 216 
reader, 211-213, 216 * 

selection effects on, 110-114, 124 
several subtests recommended, 201 
single common factor related to, 220- 
221 

Spearman-Brown formula, 63, 67, 78, 

86 

speeded tests, 201-203, 205-207, 215 
speeded test, lower bound for, 236-238, 
242 

split-half formulas, 198-201, 216 
split-half formulas compared, 200-201, 
216 

split-half methods, 198-210, 215 
statistical criteria for equality in paral¬ 
lel tests, 175, 177, 185, 186-187 
successive halves, 201-205 
test-retest, 197-198, 215 
testing conditions effect on, 108-110 
time limits as affecting, 201-203, 205- 
207, 215 

true variance effect on, 110-114 
used in weighting formula, 331-334, 
356 

weighting to maximize, 346-348, 349- 
350, 357 

Reliability index, item selection affects, 
379-380, 384 

of an item, 377,378,379,382,383*384, 
385, 387, 388, 389, 390 
Reliability of difference scores, 352-354, 
359 

computing diagram for, 354 
Restriction of range (see Selection of 
group) 

Score matrix, definition of, 427 
sum of terms in, 427-428 
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Score transformations, purposes of, 266- 
268 

types of, 267 

Scores, absolute, 283-284, 284-286, 306 
arbitrary linear transformations of, 
266-266, 304 
chance, mean of, 263, 304 
variance of, 263, 304 
criterion predicted from, 291-293, 
306 

critical points and error of measure¬ 
ment, 264-265, 304 
error, 4-6, 25, 28, 36 
graphic transformation to arbitrary 
scale, 265-266 
gross or raw, 424 
linear derived, 272-276, 305 
computing procedures, 274-275 
definition of, 272-274 
properties of, 276 
various types of, 272-273 
McCall’s T-score, 282-283, 306 
non-chance, 263-265, 304 
normalized, 280-282, 305 
.computing procedures, 282 
definition of, 280-281 
properties of, 280-282 
per cent of perfect, 264-265 
percentile, 276-280, 305 
computing procedures, 277-279 
correlation of, 280 
definition of, 276-277 
properties of, 279-280 
Scaled Scores of Cooperative Tests, 
283-284,306 

selection effect on cutting, 292 
standard, 268-272, 305 * 
computing procedures, 270-272 
definition of, 268 
properties of, 272 
time, 252-255, 261 
true, 4-0, 25, 28, 36 

Scoring formula, function of number cor¬ 
rect and errors, 249, 260 
function of number correct and num¬ 
ber blank, 248, 260 
function of number correct and num¬ 
ber unattempted, 250, 260 
maximize item-criterion correlation, 
257, 261 


Scoring formula, mean criterion differ¬ 
ences, 256-257, 261 
multiple correlation for, 255-256 
number correct, 246, 259 
“rank-order” items, 258-259, 261 
t-test used in, 256-257 
time and error scores, 253-254, 261 
weighting rights and errors, 255-256 
Selection of group, computing diagram, 
for explicit, 137 
for incidental, 134 
for relative effect on variance of ex¬ 
plicit and incidental, 139 
correlation between incidental and ex¬ 
plicit selection variables, variances 
known for explicit, 137-138, 142, 
148 

variances known for incidental, 133, 
142, 151-152, 156 

correlation between two incidental se¬ 
lection variables, variances known 
for explicit, 149-150, 156 
variances known for incidental, 153, 
156 

effects of, illustrated, 109, 128-130, 
135-136, 145-146 

explicit, definition of, 130-131, 141 
effects of, 135-138, 148-150 
formulas for, 136-137, 142, 148-150, 
156 

multivariate case, 165-166, 170 
incidental, definition of, 130-131, 141 
effects of, 132-135, 150-155 
formulas for, 133, 142, 150-153, 156 
multivariate case, 166-170, 171 
invariant function of reliability and 
validity, for explicit, 141, 143 
for incidental, 140, 143 
item difficulty parameters related to, 
367-371, 392-393 

multivariate, basic assumptions for, 
162, 170 

basic definitions for, 158-159, 161 
effect on correlation, 165-170, 171 
effect on variance, 166, 168, 169, 
171. 

practical importance of corrections for, 
145-146 

relative effect on variance of explicit 
and incidental, 138-140, 142 
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Selection of group, reliability affected by, 
110-114, 124 

univariate, basic assumptions for three- 
variable case, 146-148, 155 
basic assumptions for two-variable 
case, 131-132, 141-142 
effect on correlation between inci¬ 
dental and explicit selection vari¬ 
ables, 133,137-138,142,148,151- 
152, 156 

effect on correlation between inci¬ 
dental selection variables, 149- 

150, 153, 156 

effect on variance (standard devia¬ 
tion), 110, 124, 135, 138, 142, 148, 

151, 156 

variance (standard deviation) of ex¬ 
plicit selection variable, a function 
of incidental selection variance, 
135, 142, 151 

variance (standard deviation) of inci¬ 
dental selection variable, a func¬ 
tion of explicit selection variance, 
138, 142, 148 

a function of incidental selection 
variance of a second variable, 151— 

152, 156 

Selection of items, problems in theory of, 
391-394 

reliability index affected by, 379-380, 
384 

theory summarized, 388-391 
validity index unaffected by, 383, 384 
Skewness, effect on error of measurement, 
123 

equating of tests affected by, 296, 307 
Social Science Research Council, 392,414 
Spearman-Brown formula, experimental 
work on, 65-67 
for double length, 63, 67 
general case, 77-79, 86 
graphs illustrating, 79-82 
Spearman’s correction for attenuation, 
101-104, 105 

Speeded test, correction of score for wrong 
answers, 251-252, 260 
definition of, 230-233, 241 
effect of guessing on score, 246-251 
error of measurement for, 233-236,242 
item parameters affected by, 385-386 


Speeded test, odd-even reliability for, 
236-238, 242 

variance change in, 231-233, 241-242 
Standard deviation (see also Variance) 
Standard deviation of a difference, 40,45, 
199, 216, 425 

Standard deviation of error scores, basic 
formulas, 15-17, 26, 33-34, 37 
effect of doubling length on, 61-62, 67 
effect of length on (general case), 72-73 
Standard deviation of errors of es tima te, 
425 

Standard deviation of a sum, 70, 76, 425 
Standard deviation of test, effect of dou¬ 
bling length on, 60-61, 67 
effect of length on (general case), 69- 
71, 73 

effect of time limits on, 232-233, 241- 
242 

effect on reliability of changes in, 110- 
114, 124 

formulas showing effect of selection on, 
110, 124, 135, 138, 142, 148, 151, 
156 

item parameters related to, 375-378, 
389 

weighting scores by reciprocal of, 334- 
336, 356 

Standard deviation of true scores, basic 
formulas, 8-11, 14-15, 26, 30-32, 
34, 37 

effect of doubling iength on, 61, 62, 67 
effect of length on (general case), 71, 
73 

Standardizing, error of measurement, ef¬ 
fect on norms, 289-290 
kurtosis, effect on norms, 296, 307 
multiple trials, effect on per cent pass¬ 
ing, 265, 304, 313 

per cent of perfect score used for, 264- 
265 

regression line used for, 291-296, 297- 
298, 299-304, 306, 307 
regression line used influences norms, 
287-288, 306 

reliability, effect on norms, 289-290, 
306 

skewness, effect on norms, 296, 307 
successive hurdles, effect on per cent 
passing, 265, 304, 313 
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Statistics, sample examination items, 
437-446 

Successive hurdles, effect on standards, 
265, 304, 313 

Summation, notation for correlation ma¬ 
trix, 429 

notation for score matrix, 427 
of terms in correlation matrix, 429 
of terms in score matrix, 427-428 
Summation sign, rules for, 426-427 
use of, 426-429 

Sum, correlation of fixed test with a, 88- 
89 - 

Sums, correlation of two, 74-77, 85 
correlation of weighted, 316-319, 355 
standard deviation of, 70, 76, 425 
variance of, 75-76 

Test theory, sample examination items, 
447-460 

Time limits, error of measurement af¬ 
fected by, 233-236, 242 
reliability affected by, 236-238, 242 
variance (standard deviation) affected 
by, 232-233, 241-242 
True scores, correlation between, 101- 
104, 105 

correlation with error scores, 7, 26, 35- 

36.37 

correlation with observed scores, 22- 
23, 27, 32-33, 37 
defined as a limit, 28, 36 
defined as remainder, 5, 8, 25 
estimation of differences, 26-22 
estimation of limits, 17-20 
general considerations, 4-6, 25 
mean of, 8, 26, 29, 37 
standard deviation of, effect on relia¬ 
bility, 110-114 

equation for, 14-15, 26, 30-32, 37 
illustrations of changes in, 108-109 
relation to error of measurement, 8- 
11, 26, 34, 37 

used in definition of parallel tests, 11, 
26 

variance, effect on reliability, 110-114 
equation for, 14-15, 26, 30-32, 37 
illustrations of changes in, 106-109 
relation to error variance, 8-11, 26, 

34.37 


Unattempted items, mean and variance 
given by item analysis, 238-240, 
242-243 

odd-even reliability affected by, 236- 
238, 242 

variance affected by, 231-233, 241-242 
United States Army, 267, 273 
United States Army Air Forces, 267, 418 
United States Navy, 209, 265, 267, 273, 
304 

United States War Department, 418 
University of Chicago, The, 2, 40(n), 
204, 218(p), 229(p), 272, 286, 418 

Validity, effect of explicit selection on, 
137-138, 142, 148-150, 156 
effect of incidental selection on, 133, 
142, 151-153, 156 

effect of infinite length on, 95-98, 101- 
104, 105 

effect of multivariate selection on, 158- 
171 

effect of test length on (general for¬ 
mula), 88-90, 98-101, 104 
effect of univariate selection on, 145- 
156 

for true scores, 95-98, 101-104, 105 
function of, invariant with length of 
test, 94, 105 

invariant with selection of group, 
146-141, 143 

graphs showing effect of infinite length 
on, 96, 163 

graphs showing effect of test length 
(general case), 91, 92, 100 
illustrations of effect of selection pro¬ 
cedures on, 128-130, 135-136, 
145-146 

item difficulty related to, 374-375 
item parameters related to, 386-385, 
389 

length necessary to attain a specified, 
90, 91, 93-94, 104 

maximized by graphic method, 382- 
384 

of early tests, 1 

statistical criteria for equality in paral¬ 
lel tests, 185,186-187 
Validity index of an item, ^computing 
formula for, 387-388, 396 
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Validity index of an item, definition, 382, 
389 

invariant with item selection, 383, 384 
relation to test validity, 389-385, 389 
Variability, quotidian and reliability, 197 
Variability of group, effect of (see Selec¬ 
tion of group) 

Variance (see also Standard deviation) 
Variance due to interaction between per¬ 
sons and tests, 50-54, 57 
Variance of a difference, 40, 45,199, 216, 
425 

Variance of a sum, 70, 75-76, 425 
Variance of an item, 425 
Variance of error scores, effect of dou¬ 
bling length on, 61-62, 67 
effect of test length on (general case), 
72-73 

equation for, 15-17, 26, 33-34, 37 
relation to error of estimate, 49, 57 
relation to error of substitution, 48-49 
relation to interaction, 50-54, 57 
relation to true variance, 8-11, 26, 34- 
35.37 

Variance of test, effect of doubling length 
on, 60-61, 67 

effect of multivariate selection on, 166, 
168, 169, 171 

effect of test length on (general case), 
69r71, 73 

effect of time limits on, 232-233, 241- 
242 

effect of univariate selection on, 110, 
124, 135, 138, 142, 148, 151, 156 
effect on reliability, 110-114, 124 
effect on validity, 128-143 
equations for, 424 

item parameters related to, 375-378,, 
389 

relative effect of explicit and incidental 
selection on, 138-140, 142 
statistical criteria for equality in paral¬ 
lel teste, 175, 177, 185, 186-187 
Variance of true scores, effect of doubling 
length on, 61, 62, 67 
effect of test length on (general case), 
71, 73 

equation for, 14-15, 26, 30-32, 37 
relation to error variance, 8-11, 26, 
34-35,37 


Variance of true scores, relation to vari¬ 
ance due to persons, 54-57 

Weighted composites, correlation be¬ 
tween (general case), 316-321, 

355 

correlation between maximized, 348- 
351, 358 

correlation for random positive weights, 
321-327, 356 

derivation of correlation for random 
positive weights, 321-326 
derivation of correlation (general case), 
316-319 

derivation of weights for maximum 
correlation, 348-349 
formula for correlation between (gen¬ 
eral case), 319, 355 

formula for correlation, random posi¬ 
tive weights, 326, 356 
formula for subtest effect on, 339-340, 
357 

multiple cutting score compared with, 
312-314 

reliability maximized, 346-348, 349- 
350, 357 

subtest effect on, 338-341, 357 
subtest effect on, illustrated from basic 
engineering school data, 340-341 
to predict criterion, 327-330 
Weighting, by use of indifference func¬ 
tion, 254 

of rights and wrongs, 255-256 
of time and error scores, 254 
Weighting coefficients, determined by ex¬ 
pert judgment, 254-255, 341-342, 
357 

determined from errors of measure¬ 
ment, 336, 357 

determined from number of items, 
336-338, 356 

determined from perfect scores, 336- 
338, 356 

determined from reliability coefficients, 
331-334, 356 

determined from standard deviations, 
334-336, 356 

determined from test means, 336-338, 

356 * 
general considerations, 314-4115, 355 



486 Topic 

Weighting coefficients, multiple {see Mul¬ 
tiple regression weights) 
multiple cutting scores compared with, 
312-314 

random positive, 321-327, 356 
variables characterizing, 314-315,319- 
321, 326, 355-356 

Weighting of item alternatives, to maxi¬ 
mize item-criterion correlation, 
257, 261 

Weighting of items, to maximize criterion 
correlation, 327-330 

Weighting of scores, factor analysis used 
for, 343-345, 358-359 
to equalize marginal contribution to 
variance, 345-346, 357 


Index 

Weighting of scores, to give common fac¬ 
tor, 344-345, 358-359 
to give first centroid axis, 344, 359 
to give first principal axis, 343-344,358 
to maximize reliability, 346-348, 349- 
350, 357 

to maximize validity (see Multiple cor¬ 
relation) 

to maximize variance of composite, 
343-344, 359 

to minimize generalized variance, 343- 
344, 359 

to minimize intra-individual variance, 
343-344, 359 

Weighting of scores and criteria for maxi¬ 
mum intercorrelation, 348-351,358 
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